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Abstract 



This paper addresses the problem of planning under uncertainty in large Markov Decision 
Processes (MDPs). Factored MDPs represent a complex state space using state variables and 
the transition model using a dynamic Bayesian network. This representation often allows an 
exponential reduction in the representation size of structured MDPs, but the complexity of exact 
solution algorithms for such MDPs can grow exponentially in the representation size. In this paper, 
we present two approximate solution algorithms that exploit structure in factored MDPs. Both 
use an approximate value function represented as a linear combination of basis functions, where 
each basis function involves only a small subset of the domain variables. A key contribution of this 
paper is that it shows how the basic operations of both algorithms can be performed efficiently 
in closed form, by exploiting both additive and context- specific structure in a factored MDP. A 
central element of our algorithms is a novel linear program decomposition technique, analogous to 
variable elimination in Bayesian networks, which reduces an exponentially large LP to a provably 
equivalent, polynomial-sized one. One algorithm uses approximate linear programming, and the 
second approximate dynamic programming. Our dynamic programming algorithm is novel in that 
it uses an approximation based on max-norm, a technique that more directly minimizes the terms 
that appear in error bounds for approximate MDP algorithms. We provide experimental results 
on problems with over 10 40 states, demonstrating a promising indication of the scalability of our 
approach, and compare our algorithm to an existing state-of-the-art approach, showing, in some 
problems, exponential gains in computation time. 

1. Introduction 

Over the last few years, Markov Decision Processes (MDPs) have been used as the basic 
semantics for optimal planning for decision theoretic agents in stochastic environments. In 
the MDP framework, the system is modeled via a set of states which evolve stochastically. 
The main problem with this representation is that, in virtually any real-life domain, the 
state space is quite large. However, many large MDPs have significant internal structure, 
and can be modeled compactly if the structure is exploited in the representation. 

Factored MDPs (Boutilier, Dearden, & Goldszmidt, 2000) are one approach to repre- 
senting large, structured MDPs compactly. In this framework, a state is implicitly described 
by an assignment to some set of state variables. A dynamic Bayesian network (DBN) (Dean 
& Kanazawa, 1989) can then allow a compact representation of the transition model, by 
exploiting the fact that the transition of a variable often depends only on a small number 
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of other variables. Furthermore, the momentary rewards can often also be decomposed as 
a sum of rewards related to individual variables or small clusters of variables. 

There are two main types of structure that can simultaneously be exploited in factored 
MDPs: additive and context-specific structure. Additive structure captures the fact that 
typical large-scale systems can often be decomposed into a combination of locally inter- 
acting components. For example, consider the management of a large factory with many 
production cells. Of course, in the long run, if a cell positioned early in the production line 
generates faulty parts, then the whole factory may be affected. However, the quality of the 
parts a cell generates depends directly only on the state of this cell and the quality of the 
parts it receives from neighboring cells. Such additive structure can also be present in the 
reward function. For example, the cost of running the factory depends, among other things, 
on the sum of the costs of maintaining each local cell. 

Context- specific structure encodes a different type of locality of influence: Although a 
part of a large system may, in general, be influenced by the state of every other part of this 
system, at any given point in time only a small number of parts may influence it directly. 
In our factory example, a cell responsible for anodization may receive parts directly from 
any other cell in the factory. However, a work order for a cylindrical part may restrict this 
dependency only to cells that have a lathe. Thus, in the context of producing cylindrical 
parts, the quality of the anodized parts depends directly only on the state of cells with a 
lathe. 

Even when a large MDP can be represented compactly, for example, by using a factored 
representation, solving it exactly may still be intractable: Typical exact MDP solution al- 
gorithms require the manipulation of a value function, whose representation is linear in the 
number of states, which is exponential in the number of state variables. One approach is 
to approximate the solution using an approximate value function with a compact represen- 
tation. A common choice is the use of linear value functions as an approximation — value 
functions that are a linear combination of potentially non-linear basis functions (Bellman, 
Kalaba, & Kotkin, 1963; Sutton, 1988; Tsitsiklis & Van Roy, 1996b). Our work builds on 
the ideas of Koller and Parr (1999, 2000), by using factored (linear) value functions, where 
each basis function is restricted to some small subset of the domain variables. 

This paper presents two new algorithms for computing linear value function approxi- 
mations for factored MDPs: one that uses approximate dynamic programming and another 
that uses approximate linear programming. Both algorithms are based on the use of fac- 
tored linear value functions, a highly expressive function approximation method. This 
representation allows the algorithms to take advantage of both additive and context-specific 
structure, in order to produce high-quality approximate solutions very efficiently. The ca- 
pability to exploit both types of structure distinguishes these algorithms differ from earlier 
approaches (Boutilier et al, 2000), which only exploit context-specific structure. We provide 
a more detailed discussion of the differences in Section 10. 

We show that, for a factored MDP and factored value functions, various critical oper- 
ations for our planning algorithms can be implemented in closed form without necessarily 
enumerating the entire state space. In particular, both our new algorithms build upon a 
novel linear programming decomposition technique. This technique reduces structured LPs 
with exponentially many constraints to equivalent, polynomially-sized ones. This decompo- 
sition follows a procedure analogous to variable elimination that applies both to additively 
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structured value functions (Bertele & Brioschi, 1972) and to value functions that also ex- 
ploit context-specific structure (Zhang & Poole, 1999). Using these basic operations, our 
planning algorithms can be implemented efficiently, even though the size of the state space 
grows exponentially in the number of variables. 

Our first method is based on the approximate linear programming algorithm (Schweitzer 
& Seidmann, 1985). This algorithm generates a linear, approximate value function by 
solving a single linear program. Unfortunately, the number of constraints in the LP proposed 
by Schweitzer and Seidmann grows exponentially in the number of variables. Using our LP 
decomposition technique, we exploit structure in factored MDPs to represent exactly the 
same optimization problem with exponentially fewer constraints. 

In terms of approximate dynamic programming, this paper makes a twofold contribution. 
First, we provide a new approach for approximately solving MDPs using a linear value 
function. Previous approaches to linear function approximation typically have utilized a 
least squares (^2-norm) approximation to the value function. Least squares approximations 
are incompatible with most convergence analyses for MDPs, which are based on max-norm. 
We provide the first MDP solution algorithms — both value iteration and policy iteration — 
that use a linear max-norm projection to approximate the value function, thereby directly 
optimizing the quantity that appears in our provided error bounds. Second, we show how 
to exploit the structure of the problem to apply this technique to factored MDPs, by again 
leveraging on our LP decomposition technique. 

Although approximate dynamic programming currently possesses stronger theoretical 
guarantees, our experimental results suggest that approximate linear programming is a 
good alternative. Whereas the former tends to generate better policies for the same set of 
basis functions, due to the simplicity and computational advantages of approximate linear 
programming, we can add more basis functions, obtaining a better policy and still requiring 
less computation than the approximate dynamic programming approach. 

Finally, we present experimental results comparing our approach to the work of Boutilier 
et al. (2000), illustrating some of the tradeoffs between the two methods. In particular, for 
problems with significant context-specific structure in the value function, their approach 
can be faster due to their efficient handling of their value function representation. However, 
there are cases with significant context-specific structure in the problem, rather than in 
the value function, in which their algorithm requires an exponentially large value function 
representation. In such classes of problems, we demonstrate that by using a value func- 
tion that exploits both additive and context-specific structure, our algorithm can obtain a 
polynomial-time near-optimal approximation of the true value function. 

This paper starts with a presentation of factored MDPs and approximate solution al- 
gorithms for MDPs. In Section 4, we describe the basic operations used in our algorithms, 
including our LP decomposition technique. In Section 5, we present the first of our two 
algorithms: the approximate linear programming algorithm for factored MDPs. The second 
algorithm, approximate policy iteration with max-norm projection, is presented in Section 6. 
Section 7 describes an approach for efficiently computing bounds on policy quality based on 
the Bellman error. Section 8 shows how to extend our methods to deal with context-specific 
structure. Our paper concludes with an empirical evaluation in Section 9 and a discussion 
of related work in Section 10. 
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This paper is a greatly expanded version of work that was published before in Guestrin 
et al. (2001a), and some of the work presented in Guestrin et al. (2001b, 2002). 

2. Factored Markov Decision Processes 

A Markov decision process (MDP) is a mathematical framework for sequential decision 
problems in stochastic domains. It thus provides an underlying semantics for the task of 
planning under uncertainty. We begin with a concise overview of the MDP framework, and 
then describe the representation of factored MDPs. 

2.1 Markov Decision Processes 

We briefly review the MDP framework, referring the reader to the books by Bertsekas and 
Tsitsiklis (1996) or Puterman (1994) for a more in-depth review. A Markov Decision Process 
(MDP) is defined as a 4-tuple (X, A, R, P) where: X is a finite set of |X| = N states; A is 
a finite set of actions; R is a reward function R : X x A i— > R, such that P(x, a) represents 
the reward obtained by the agent in state x after taking action a; and P is a Markovian 
transition model where P(x' | x, a) represents the probability of going from state x to state 
x' with action a. We assume that the rewards are bounded, that is, there exists R m ax such 
that R max > |P(x, a) | , Vx,a. 

Example 2.1 Consider the problem of optimizing the behavior of a system administrator 
(SysAdmin) maintaining a network of m computers. In this network, each machine is 
connected to some subset of the other machines. Various possible network topologies can be 
defined in this manner (see Figure 1 for some examples) . In one simple network, we might 
connect the machines in a ring, with machine i connected to machines i + 1 and i — (In 
this example, we assume addition and subtraction are performed modulo m.) 

Each machine is associated with a binary random variable Xi, representing whether it 
is working or has failed. At every time step, the SysAdmin receives a certain amount of 
money (reward) for each working machine. The job of the SysAdmin is to decide which 
machine to reboot; thus, there are m + 1 possible actions at each time step: reboot one of the 
m machines or do nothing (only one machine can be rebooted per time step). If a machine 
is rebooted, it will be working with high probability at the next time step. Every machine 
has a small probability of failing at each time step. However, if a neighboring machine fails, 
this probability increases dramatically. These failure probabilities define the transition model 
P(x' | x, a), where x is a particular assignment describing which machines are working or 
have failed in the current time step, a is the SysAdmin's choice of machine to reboot and x' 
is the resulting state in the next time step. □ 

We assume that the MDP has an infinite horizon and that future rewards are discounted 
exponentially with a discount factor 7 € [0,1). A stationary policy n for an MDP is a 
mapping ir : X 1— > A, where 7r(x) is the action the agent takes at state x. In the computer 
network problem, for each possible configuration of working and failing machines, the policy 
would tell the SysAdmin which machine to reboot. Each policy is associated with a value 
function W £ M, N , where Vtt(x) is the discounted cumulative value that the agent gets if 
it starts at state x and follows policy ir. More precisely, the value V w of a state x under 
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Figure 1: Network topologies tested; the status of a machine is influence by the status of 
its parent in the network. 



policy 7r is given by: 



K(x) = E % 



Xy*(xw tt(x<*>)) 



t=0 



x(°) = 



X 



where is a random variable representing the state of the system after t steps. In our 
running example, the value function represents how much money the SysAdmin expects to 
collect if she starts acting according to ir when the network is at state x. The value function 
for a fixed policy is the fixed point of a set of linear equations that define the value of a 
state in terms of the value of its possible successor states. More formally, we define: 

Definition 2.2 The DP operator, for a stationary policy it is: 

7;V(x) = 2?(x,tt(x)) + 7 ^P(x' I x,7r(x))V(x'). 

x' 

The value function of policy ir, V w , is the fixed point of the T n operator: V n = T n V n . □ 

The optimal value function V* describes the optimal value the agent can achieve for 
each starting state. V* is also defined by a set of non-linear equations. In this case, the 
value of a state must be the maximal expected value achievable by any policy starting at 
that state. More precisely, we define: 

Definition 2.3 The Bellman operator, T* , is: 

T*V(x) =max[i?(x,a) + 7 ^P(x' | x,a)V(x')]. 

° x' 

The optimal value function V* is the fixed point of T* : V* = T*V* . □ 

For any value function V, we can define the policy obtained by acting greedily relative 
to V. In other words, at each state, the agent takes the action that maximizes the one-step 



403 



GUESTRIN, KOLLER, PARR & VENKATARAMAN 



utility, assuming that V represents our long-term utility achieved at the next state. More 
precisely, we define: 



The greedy policy relative to the optimal value function V* is the optimal policy ir* = 
Greed y(V*). 

2.2 Factored MDPs 

Factored MDPs are a representation language that allows us to exploit problem structure 
to represent exponentially large MDPs very compactly. The idea of representing a large 
MDP using a factored model was first proposed by Boutilier et al. (1995). 

In a factored MDP, the set of states is described via a set of random variables X = 
{Xi, . . . ,X n }, where each X{ takes on values in some finite domain Dom(JQ). A state x 
defines a value Xj G Dom(Xj) for each variable JQ. In general, we use upper case letters 
(e.g., X) to denote random variables, and lower case (e.g., x) to denote their values. We 
use boldface to denote vectors of variables (e.g., X) or their values (x). For an instantiation 
y € Dom(Y) and a subset of these variables Z C Y, we use y[Z] to denote the value of the 
variables Z in the instantiation y. 

In a factored MDP, we define a state transition model r using a dynamic Bayesian 
network (DBN) (Dean & Kanazawa, 1989). Let Xi denote the variable Xi at the current 
time and X[, the same variable at the next step. The transition graph of a DBN is a 
two-layer directed acyclic graph G T whose nodes are {X±, . . . , X n , X[, . . . , X' n }. We denote 
the parents of X[ in the graph by Parents T (X' i ). For simplicity of exposition, we assume 
that Parents T (X' i ) C X; thus, all arcs in the DBN are between variables in consecutive 
time slices. (This assumption is used for expository purposes only; intra-time slice arcs 
are handled by a small modification presented in Section 4.1.) Each node X[ is associated 
with a conditional 'probability distribution (CPD) P T (X' i \ Parents T (X' i )). The transition 
probability P T (x' | x) is then defined to be: 



where Uj is the value in x of the variables in Parents T (X' i ). 

Example 2.4 Consider an instance of the Sys Admin problem with four computers, labelled 
Mi, . . . , M4, in an unidirectional ring topology as shown in Figure 2(a). Our first task in 
modeling this problem as a factored MDP is to define the state space X. Each machine 
is associated with a binary random variable Xi, representing whether it is working or has 
failed. Thus, our state space is represented by four random variables: {Xi, Xi, X3, -X4}. 
The next task is to define the transition model, represented as a DBN. The parents of the 
next time step variables X[ depend on the network topology. Specifically, the probability that 
machine i will fail at the next time step depends on whether it is working at the current 
time step and on the status of its direct neighbors (parents in the topology) in the network 
at the current time step. As shown in Figure 2(b), the parents of X[ in this example are Xi 
and Xi-i. The CPD of X- is such that if Xi = false, then X[ = false with high probability; 




(1) 



P r (x' |x) = JJP r (^ |Ui) , 
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that is, failures tend to persist. If X, = true, then X- is a noisy or of its other parents ( in 
the unidirectional ring topology X[ has only one other parent X^-i); that is, a failure in any 
of its neighbors can independently cause machine i to fail. □ 

We have described how to represent factored the Markovian transition dynamics arising 
from an MDP as a DBN, but we have not directly addressed the representation of actions. 
Generally, we can define the transition dynamics of an MDP by defining a separate DBN 
model T a = (G a ,P a ) for each action a. 

Example 2.5 In our system administrator example, we have an action ai for rebooting 
each one of the machines, and a default action d for doing nothing. The transition model 
described above corresponds to the "do nothing" action. The transition model for dj is 
different from d only in the transition model for the variable X[, which is now X- = true 
with probability one, regardless of the status of the neighboring machines. Figure 2(c) shows 
the actual CPD for P(X' i = Working \ Xi, Xi_i, A), with one entry for each assignment to 
the state variables X-i and Xi_\, and to the action A. □ 

To fully specify an MDP, we also need to provide a compact representation of the reward 
function. We assume that the reward function is factored additively into a set of localized 
reward functions, each of which only depends on a small set of variables. In our example, we 
might have a reward function associated with each machine i, which depends on Xj. That 
is, the SysAdmin is paid on a per-machine basis: at every time step, she receives money for 
machine i only if it is working. We can formalize this concept of localized functions: 

Definition 2.6 A function f has a scope Scope[/] = C C X if f : Dom(C) i-> It. □ 

If / has scope Y and Y C Z, we use /(z) as shorthand for /(y) where y is the part of the 
instantiation z that corresponds to variables in Y. 
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We can now characterize the concept of local rewards. Let Rf, be a set of 

functions, where the scope of each Rf is restricted to variable cluster C {X±, • • • ,X n }. 
The reward for taking action a at state x is defined to be i? a (x) = Ya=i Rf (Uf ) € R- In 
our example, we have a reward function Ri associated with each machine i, which depends 
only Xi, and does not depend on the action choice. These local rewards are represented 
by the diamonds in Figure 2(b), in the usual notation for influence diagrams (Howard &; 
Matheson, 1984). 

3. Approximate Solution Algorithms 

There are several algorithms to compute the optimal policy in an MDP. The three most 
commonly used are value iteration, policy iteration, and linear programming. A key compo- 
nent in all three algorithms is the computation of value functions, as defined in Section 2.1. 
Recall that a value function defines a value for each state x in the state space. With an 
explicit representation of the value function as a vector of values for the different states, 
the solution algorithms all can be implemented as a series of simple algebraic steps. Thus, 
in this case, all three can be implemented very efficiently. 

Unfortunately, in the case of factored MDPs, the state space is exponential in the number 
of variables in the domain. In the Sys Admin problem, for example, the state x of the system 
is an assignment describing which machines are working or have failed; that is, a state x 
is an assignment to each random variable X^. Thus, the number of states is exponential in 
the number m of machines in the network (|X| = N = 2 m ). Hence, even representing an 
explicit value function in problems with more than about ten machines is infeasible. One 
might be tempted to believe that factored transition dynamics and rewards would result in 
a factored value function, which can thereby be represented compactly. Unfortunately, even 
in trivial factored MDPs, there is no guarantee that structure in the model is preserved in 
the value function (Koller & Parr, 1999). 

In this section, we discuss the use of an approximate value function, that admits a 
compact representation. We also describe approximate versions of these exact algorithms, 
that use approximate value functions. Our description in this section is somewhat abstract, 
and does not specify how the basic operations required by the algorithms can be performed 
explicitly. In later sections, we elaborate on these issues, and describe the algorithms in 
detail. For brevity, we choose to focus on policy iteration and linear programming; our 
techniques easily extend to value iteration. 

3.1 Linear Value Functions 

A very popular choice for approximating value functions is by using linear regression, as first 
proposed by Bellman et al. (1963). Here, we define our space of allowable value functions 
VeWC via a set of basis functions: 

Definition 3.1 A linear value function over a set of basis functions H = {hi, . . . , h^} 
is a function V that can be written as V(x) = J2j=i w j ^j( x ) f or some coefficients w = 

(wi,...,w k y. □ 

We can now define TL to be the linear subspace of R^ spanned by the basis functions H. 
It is useful to define an N x k matrix H whose columns are the k basis functions viewed as 



406 



Efficient Solution Algorithms for Factored MDPs 



vectors. In a more compact notation, our approximate value function is then represented 
by Hw. 

The expressive power of this linear representation is equivalent, for example, to that 
of a single layer neural network with features corresponding to the basis functions defining 
H. Once the features are defined, we must optimize the coefficients w in order to obtain a 
good approximation for the true value function. We can view this approach as separating 
the problem of defining a reasonable space of features and the induced space 7i, from the 
problem of searching within the space. The former problem is typically the purview of 
domain experts, while the latter is the focus of analysis and algorithmic design. Clearly, 
feature selection is an important issue for essentially all areas of learning and approximation. 
We offer some simple methods for selecting good features for MDPs in Section 11, but it is 
not our goal to address this large and important topic in this paper. 

Once we have a chosen a linear value function representation and a set of basis functions, 
the problem becomes one of finding values for the weights w such that Hw will yield 
a good approximation of the true value function. In this paper, we consider two such 
approaches: approximate dynamic programming using policy iteration and approximate 
linear programming. In this section, we present these two approaches. In Section 4, we 
show how we can exploit problem structure to transform these approaches into practical 
algorithms that can deal with exponentially large state spaces. 

3.2 Policy Iteration 

3.2.1 The Exact Algorithm 

The exact policy iteration algorithm iterates over policies, producing an improved policy at 
each iteration. Starting with some initial policy tt^°\ each iteration consists of two phases. 
Value determination computes, for a policy tt^\ the value function V n (t), by finding the 
fixed point of the equation T^V^t) = V n (t), that is, the unique solution to the set of linear 
equations: 

V wW (x) = i?(x,7r«(x)) + 7 E P ( X ' I x,7r(*)(x))V 7r(t) (x / ),Vx. 

x' 

The policy improvement step defines the next policy as 

n (t+i) = GreedyiVn(t) ). 

It can be shown that this process converges to the optimal policy (Bertsekas &: Tsitsiklis, 
1996). Furthermore, in practice, the convergence to the optimal policy is often very quick. 

3.2.2 Approximate Policy Iteration 

The steps in the policy iteration algorithm require a manipulation of both value functions 
and policies, both of which often cannot be represented explicitly in large MDPs. To define 
a version of the policy iteration algorithm that uses approximate value functions, we use 
the following basic idea: We restrict the algorithm to using only value functions within the 
provided Tt; whenever the algorithm takes a step that results in a value function V that is 
outside this space, we project the result back into the space by finding the value function 
within the space which is closest to V. More precisely: 
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Definition 3.2 A projection operator II is a mapping II : H N — ► Ti.. II is said to be a 
projection w.r.t. a norm ||-|| if UV = Hw* such that w* £ argmin w ||Hw — V||. □ 

That is, IIV is the linear combination of the basis functions, that is closest to V with respect 
to the chosen norm. 

Our approximate policy iteration algorithm performs the policy improvement step ex- 
actly. In the value determination step, the value function — the value of acting according to 
the current policy 7T« — is approximated through a linear combination of basis functions. 

We now consider the problem of value determination for a policy ir^. At this point, 
it is useful to introduce some notation: Although the rewards are a function of the state 
and action choice, once the policy is fixed, the rewards become a function of the state 
only, which we denote as R n (t), where R n ( t )(x) = P(x, 7r^(x)). Similarly, for the transition 
model: P 7r (t)(x / j x) = P(x' | x, 7i"(*)(x)). We can now rewrite the value determination step 
in terms of matrices and vectors. If we view V n ( t ) and R n (t) as V-vectors, and P n ( t) as an 
N x N matrix, we have the equations: 

V 7r(t) = R n ( t) +jP 7T (t ) V 7T (t). 

This is a system of linear equations with one equation for each state, which can only be 
solved exactly for relatively small N. Our goal is to provide an approximate solution, within 
TL. More precisely, we want to find: 

w (t) _ arg mi n || Hw - (R^t) + 7P 7r ( t )Hw)|| ; 

(H-7P w( t)H)wW-^ (t ) 
Thus, our approximate policy iteration alternates between two steps: 

w (t> = argmin||Hw-(P^ (t) + 7 P^ (t) Hw)||; (2) 

W 

n (t+i) = Greedy(HwW). (3) 



arg mm 

w 



3.2.3 Max-norm Projection 

An approach along the lines described above has been used in various papers, with several 
recent theoretical and algorithmic results (Schweitzer & Seidmann, 1985; Tsitsiklis & Van 
Roy, 1996b; Van Roy, 1998; Koller & Parr, 1999, 2000). However, these approaches suffer 
from a problem that we might call "norm incompatibility." When computing the projection, 
they utilize the standard Euclidean projection operator with respect to the Li norm or a 
weighted C2 norm. 1 On the other hand, most of the convergence and error analyses for MDP 
algorithms utilize max-norm (£00 )• This incompatibility has made it difficult to provide 
error guarantees. 

We can tie the projection operator more closely to the error bounds through the use 
of a projection operator in norm. The problem of minimizing the norm has been 
studied in the optimization literature as the problem of finding the Chebyshev solution 2 to 

1. Weighted £2 norm projections are stable and have meaningful error bounds when the weights correspond 
to the stationary distribution of a fixed policy under evaluation (value determination) (Van Roy, 1998), 
but they are not stable when combined with T* . Averagers (Gordon, 1995) are stable and non-expansive 
in Coo, but require that the mixture weights be determined a priori. Thus, they do not, in general, 
minimize error. 

2. The Chebyshev norm is also referred to as max, supremum and Caa norms and the minimax solution. 
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an overdetermined linear system of equations (Cheney, 1982). The problem is defined as 
finding w* such that: 

w* <G argmin IICw - bll . (4) 

We use an algorithm due to Stiefel (1960), that solves this problem by linear program- 
ming: 

Variables: w\, . . . , w^, (p ; 
Minimize: <p ; 



Subject to: cp > Y?j=i c ij w j ~ h and 

<P > k - J2j=i CijWj, i = 1...N. 



(5) 



The constraints in this linear program imply that <p > J2j=i c ij w j ~ hi for each i, or 
equivalently, that <p > ||Cw — bjl^. The objective of the LP is to minimize <p. Thus, at the 
solution (w*, <p*) of this linear program, w* is the solution of Equation (4) and <p is the 
projection error. 

We can use the projection in the context of the approximate policy iteration in the 
obvious way. When implementing the projection operation of Equation (2), we can use 
the Coo projection (as in Equation (4)), where C = (H — 7P 7r ( t )H) and b = R n (t). This 
minimization can be solved using the linear program of (5). 

A key point is that this LP only has k + 1 variables. However, there are 2N constraints, 
which makes it impractical for large state spaces. In the SysAdmin problem, for example, 
the number of constraints in this LP is exponential in the number of machines in the network 
(a total of 2 • 2 m constraints for m machines). In Section 4, we show that, in factored MDPs 
with linear value functions, all the 2N constraints can be represented efficiently, leading to 
a tractable algorithm. 

3.2.4 Error Analysis 

We motivated our use of the max-norm projection within the approximate policy iteration 
algorithm via its compatibility with standard error analysis techniques for MDP algorithms. 
We now provide a careful analysis of the impact of the error introduced by the projec- 
tion step. The analysis provides motivation for the use of a projection step that directly 
minimizes this quantity. We acknowledge, however, that the main impact of this analysis 
is motivational. In practice, we cannot provide a priori guarantees that an projection 
will outperform other methods. 

Our goal is to analyze approximate policy iteration in terms of the amount of error 
introduced at each step by the projection operation. If the error is zero, then we are 
performing exact value determination, and no error should accrue. If the error is small, we 
should get an approximation that is accurate. This result follows from the analysis below. 
More precisely, we define the projection error as the error resulting from the approximate 
value determination step: 

/3« = HwW - f^ (t) + 7 P T(t) HwW) . 

V / oo 

Note that, by using our max-norm projection, we are finding the set of weights w^) that 
exactly minimizes the one-step projection error /?(*). That is, we are choosing the best 



409 



GUESTRIN, KOLLER, PARR & VENKATARAMAN 



possible weights with respect to this error measure. Furthermore, this is exactly the error 
measure that is going to appear in the bounds of our theorem. Thus, we can now make the 
bounds for each step as tight as possible. 

We first show that the projection error accrued in each step is bounded: 

Lemma 3.3 The value determination error is bounded: There exists a constant f3p < R ma x 
such that j3p > (3^ for all iterations t of the algorithm. 

Proof: See Appendix A.l. □ 

Due to the contraction property of the Bellman operator, the overall accumulated error 
is a decaying average of the projection error incurred throughout all iterations: 

Definition 3.4 The discounted value determination error at iteration t is defined as: (3® = 

Lemma 3.3 implies that the accumulated error remains bounded in approximate policy 
iteration: p {t) < We can now bound the loss incurred when acting according 

to the policy generated by our approximate policy iteration algorithm, as opposed to the 
optimal policy: 

Theorem 3.5 In the approximate policy iteration algorithm, let ir^ be the policy generated 
at iteration t. Furthermore, let V n (t) be the actual value of acting according to this policy. 
The loss incurred by using policy ir® as opposed to the optimal policy ir* with value V* is 
bounded by: 

II v* - v wW iu < 7 * || v* - v^o IL + -^— 2 . (6) 



Proof: See Appendix A. 2. □ 

In words, Equation (6) shows that the difference between our approximation at iteration 
t and the optimal value function is bounded by the sum of two terms. The first term is 
present in standard policy iteration and goes to zero exponentially fast. The second is the 
discounted accumulated projection error and, as Lemma 3.3 shows, is bounded. This second 
term can be minimized by choosing as the one that minimizes: 



Hw«-(^ (t)+7 ^ (t) HwW)| 



which is exactly the computation performed by the max-norm projection. Therefore, this 
theorem motivates the use of max-norm projections to minimize the error term that appears 
in our bound. 

The bounds we have provided so far may seem fairly trivial, as we have not provided 
a strong a priori bound on /?(*). Fortunately, several factors make these bounds interest- 
ing despite the lack of a priori guarantees. If approximate policy iteration converges, as 
occurred in all of our experiments, we can obtain a much tighter bound: If tt is the policy 
after convergence, then: 

^Hoc - (1_ 7 )' 

where /fc is the one-step max-norm projection error associated with estimating the value 
of 7f. Since the max-norm projection operation provides (3^, we can easily obtain an a 
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posteriori bound as part of the policy iteration procedure. More details are provided in 
Section 7. 

One could rewrite the bound in Theorem 3.5 in terms of the worst case projection er- 
ror (3p, or the worst projection error in a cycle of policies, if approximate policy iteration 
gets stuck in a cycle. These formulations would be closer to the analysis of Bertsekas and 
Tsitsiklis (1996, Proposition 6.2, p. 276). However, consider the case where most policies 
(or most policies in the final cycle) have a low projection error, but there are a few policies 
that cannot be approximated well using the projection operation, so that they have a large 
one-step projection error. A worst-case bound would be very loose, because it would be 
dictated by the error of the most difficult policy to approximate. On the other hand, using 
our discounted accumulated error formulation, errors introduced by policies that are hard 
to approximate decay very rapidly. Thus, the error bound represents an "average" case 
analysis: a decaying average of the projection errors for policies encountered at the succes- 
sive iterations of the algorithm. As in the convergent case, this bound can be computed 
easily as part of the policy iteration procedure when max-norm projection is used. 

The practical benefit of a posteriori bounds is that they can give meaningful feedback on 
the impact of the choice of the value function approximation architecture. While we are not 
explicitly addressing the difficult and general problem of feature selection in this paper, our 
error bounds motivate algorithms that aim to minimize the error given an approximation 
architecture and provide feedback that could be useful in future efforts to automatically 
discover or improve approximation architectures. 

3.3 Approximate Linear Programming 

3.3.1 The Exact Algorithm 

Linear programming provides an alternative method for solving MDPs. It formulates the 
problem of finding a value function as a linear program (LP). Here the LP variables are 
Vi, . . . , Vjvj where Vi represents V(xj): the value of starting at the ith state of the system. 
The LP is given by: 



where the state relevance weights a are positive. Note that, in this exact case, the solution 
obtained is the same for any positive weight vector. It is interesting to note that steps of 
the simplex algorithm correspond to policy changes at single states, while steps of policy 
iteration can involve policy changes at multiple states. In practice, policy iteration tends 
to be faster than the linear programming approach (Puterman, 1994). 

3.3.2 Approximate Linear Program 

The approximate formulation for the LP approach, first proposed by Schweitzer and Sei- 
dmann (1985), restricts the space of allowable value functions to the linear space spanned 
by our basis functions. In this approximate formulation, the variables are wi, . . . ,Wk- the 
weights for our basis functions. The LP is given by: 



Variables: 
Minimize: 
Subject to 



V 1: ...,V N ; 

Ex, «( x V ; 

Vi > [R(xi, a) + 7 J2j P( Xj | Xi, a)Vj] Vx.; e X, a € A, 



(7) 
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Variables: w\ , . . . , ; 
Minimize: J2x a ( x -)^2i w i ^«( x ) > 

Subject to: J2i w i ^«( x ) > [K(x,a) +7Ex'-^( x ' I x J a )J2i w i hi(x')] Vx G X,Va € A. 

(8) 

In other words, this formulation takes the LP in (7) and substitutes the explicit state 
value function by a linear value function representation J2% w i hi(x), or, in our more compact 
notation, V is replaced by Hw. This linear program is guaranteed to be feasible if a constant 
function — a function with the same constant value for all states — is included in the set 
of basis functions. 

In this approximate linear programming formulation, the choice of state relevance weights, 
a, becomes important. Intuitively, not all constraints in this LP are binding; that is, the 
constraints are tighter for some states than for others. For each state x, the relevance 
weight a(x) indicates the relative importance of a tight constraint. Therefore, unlike the 
exact case, the solution obtained may differ for different choices of the positive weight vector 
a. Furthermore, there is, in general, no guarantee as to the quality of the greedy policy 
generated from the approximation Hw. However, the recent work of de Farias and Van 
Roy (2001a) provides some analysis of the error relative to that of the best possible approx- 
imation in the subspace, and some guidance as to selecting a so as to improve the quality 
of the approximation. In particular, their analysis shows that this LP provides the best 
approximation Hw* of the optimal value function V* in a weighted C\ sense subject to the 
constraint that Hw* > T*Hw*, where the weights in the L\ norm are the state relevance 
weights a. 

The transformation from an exact to an approximate problem formulation has the ef- 
fect of reducing the number of free variables in the LP to k (one for each basis function 
coefficient), but the number of constraints remains N x \A\. In our SysAdmin problem, for 
example, the number of constraints in the LP in (8) is (m + 1) • 2 m , where m is the number 
of machines in the network. Thus, the process of generating the constraints and solving the 
LP still seems unmanageable for more than a few machines. In the next section, we discuss 
how we can use the structure of a factored MDP to provide for a compact representation 
and an efficient solution to this LP. 

4. Factored Value Functions 

The linear value function approach, and the algorithms described in Section 3, apply to any 
choice of basis functions. In the context of factored MDPs, Koller and Parr (1999) suggest 
a particular type of basis function, that is particularly compatible with the structure of a 
factored MDP. They suggest that, although the value function is typically not structured, 
there are many cases where it might be "close" to structured. That is, it might be well- 
approximated using a linear combination of functions each of which refers only to a small 
number of variables. More precisely, we define: 

Definition 4.1 A factored (linear) value function is a linear function over the basis set 
hi,. . . ,hk, where the scope of each hi is restricted to some subset of variables Cj. □ 

Value functions of this type have a long history in the area of multi-attribute utility the- 
ory (Keeney &; Raiffa, 1976). In our example, we might have a basis function hi for each 
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machine, indicating whether it is working or not. Each basis function has scope restricted 
to Xj. These are represented as diamonds in the next time step in Figure 2(b). 

Factored value functions provide the key to performing efficient computations over the 
exponential-sized state spaces we have in factored MDPs. The main insight is that re- 
stricted scope functions (including our basis functions) allow for certain basic operations to 
be implemented very efficiently. In the remainder of this section, we show how structure in 
factored MDPs can be exploited to perform two crucial operations very efficiently: one-step 
lookahead (backprojection), and the representation of exponentially many constraints in 
the LPs. Then, we use these basic building blocks to formulate very efficient approxima- 
tion algorithms for factored MDPs, each presented in its own self-contained section: the 
approximate linear programming for factored MDPs in Section 5, and approximate policy 
iteration with max-norm projection in Section 6. 

4.1 One-step Lookahead 

A key step in all of our algorithms is the computation of the one-step lookahead value of 
some action a. This is necessary, for example, when computing the greedy policy as in 
Equation (1). Let's consider the computation of a Q function, Q a (x), which represents the 
expected value the agent obtains after taking action a at the current time step and receiving 
a long-term value V thereafter. This Q function can be computed by: 

Q (x) = i?(x, a) + 7 E P ( x ' I x ' a M x )- ( 9 ) 

x' 

That is, Q a (x) is given by the current reward plus the discounted expected future value. 
Using this notation, we can express the greedy policy as: Greedy(V)(x.) = max a Q a (x). 

Recall that we are estimating the long-term value of our policy using a set of basis 
functions: V(x) = J2i w i ^j( x ). Thus, we can rewrite Equation (9) as: 

Q (x) = i?(x, a) + 7 E P ( x ' ! x ' °) E w i ^( x )' ( 10 ) 
x' i 

The size of the state space is exponential, so that computing the expectation P(x' | 
x, a)Y^i w i hi(x) seems infeasible. Fortunately, as discussed by Koller and Parr (1999), 
this expectation operation, or backprojection, can be performed efficiently if the transition 
model and the value function are both factored appropriately. The linearity of the value 
function permits a linear decomposition, where each summand in the expectation can be 
viewed as an independent value function and updated in a manner similar to the value 
iteration procedure used by Boutilier et al. (2000). We now recap the construction briefly, 
by first defining: 

G a (x) = E P{* I x, a) E Wi hi(x!) = 5><E P ( x ' I x > fl )^( x ')- 

x' i i x' 

Thus, we can compute the expectation of each basis function separately: 

ft tt (x)=E P (x'|x,a)/ lj (x / ), 

x' 



413 



GUESTRIN, KOLLER, PARR & VENKATARAMAN 



and then weight them by wi to obtain the total expectation G a (x) = Y^i w i <?f(x). The 
intermediate function gf is called the backprojection of the basis function hi through the 
transition model P a , which we denote by gf = P a hi. Note that, in factored MDPs, the 
transition model P a is factored (represented as a DBN) and the basis functions hi have 
scope restricted to a small set of variables. These two important properties allow us to 
compute the backprojections very efficiently. 

We now show how some restricted scope function h (such as our basis functions) 
can be backprojected through some transition model P T represented as a DBN r. Here 
h has scope restricted to Y; our goal is to compute g = P T h. We define the back- 
projected scope of Y through r as the set of parents of Y' in the transition graph G T ; 
rV(Y') = \J Y i (zy Pctrents T (Yl) . If intra-time slice arcs are included, so that Parents T (X' i ) G 
{Xi, . . . , X n , X[, . . . , X' a }, then the only change in our algorithm is in the definition of back- 
projected scope of Y through r. The definition now includes not only direct parents of Y', 
but also all variables in {X±, . . . , X n } that are ancestors of Y'\ 

r r (Y') = {Xj | there exist a directed path from Xj to any X\ G Y'}. 

Thus, the backprojected scope may become larger, but the functions are still factored. 

We can now show that, if h has scope restricted to Y, then its backprojection g has 
scope restricted to the parents of Y', i.e., r r (Y'). Furthermore, each backprojection can 
be computed by only enumerating settings of variables in r r (Y'), rather than settings of 
all variables X: 

<?(x) = (P T /»)(x); 

= £P T (x'|x)Mx'); 

x' 

= ^(x'lx^y'); 

x' 

= ^P T (y'|x)My') £ Pr(u ; |x); 
y' u'e(x'-y') 

= £p r (y'|z)My'); 

y' 

= g(z); 

where z is the value of r r (Y') in x and the term X)u'e(x'-y') Pr( u ' | x) = 1 as it is the 
sum of a probability distribution over a complete domain. Therefore, we see that (P T h) is a 
function whose scope is restricted to r T (Y'). Note that the cost of the computation depends 
linearly on |Dom(r r (Y'))|, which depends on Y (the scope of h) and on the complexity of 
the process dynamics. This backprojection procedure is summarized in Figure 3. 

Returning to our example, consider a basis function hi that is an indicator of variable JQ: 
it takes value 1 if the ith machine is working and otherwise. Each hi has scope restricted to 
X-, thus, its backprojection gi has scope restricted to Parents r {X' i ): T T {X' i ) = 

4.2 Representing Exponentially Many Constraints 

As seen in Section 3, both our approximation algorithms require the solution of linear pro- 
grams: the LP in (5) for approximate policy iteration, and the LP in (8) for the approximate 
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Backproj a (h) WHERE BASIS FUNCTION h HAS SCOPE C. 

Define the scope of the backprojection: T a (C) = \J X '.eG'Par ents a (X' i ). 
For each ASSIGNMENT y g r a (C): 

Return g a . 

Figure 3: Backprojection of basis function h. 

linear programming algorithm. These LPs have some common characteristics: they have 
a small number of free variables (for k basis functions there are k + 1 free variables in ap- 
proximate policy iteration and k in approximate linear programming), but the number of 
constraints is still exponential in the number of state variables. However, in factored MDPs, 
these LP constraints have another very useful property: the functionals in the constraints 
have restricted scope. This key observation allows us to represent these constraints very 
compactly. 

First, observe that the constraints in the linear programs are all of the form: 

0>X> Ci(x)-6(x),Vx, (11) 

i 

where only <\> and w±, . . . , Wk are free variables in the LP and x ranges over all states. This 
general form represents both the type of constraint in the max-norm projection LP in (5) 
and the approximate linear programming formulation in (8). 3 

The first insight in our construction is that we can replace the entire set of constraints 
in Equation (11) by one equivalent non-linear constraint: 

4> > max Wj Cj(x) — 6(x). (12) 

i 

The second insight is that this new non-linear constraint can be implemented by a set of 
linear constraints using a construction that follows the structure of variable elimination in 
cost networks. This insight allows us to exploit structure in factored MDPs to represent 
this constraint compactly. 

We tackle the problem of representing the constraint in Equation (12) in two steps: 
first, computing the maximum assignment for a fixed set of weights; then, representing the 
non- linear constraint by small set of linear constraints, using a construction we call the 
factored LP. 

4.2.1 Maximizing Over the State Space 

The key computation in our algorithms is to represent a non-linear constraint of the form 
in Equation (12) efficiently by a small set of linear constraints. Before presenting this con- 
struction, let's first consider a simpler problem: Given some fixed weights Wi, we would 
like to compute the maximization: cf>* = max x J2i w i c «( x ) — K x )> that i s > the state x, such 

3. The complementary constraints in (5), <f> > 6(x) — ^\ Wi Cj(x), can be formulated using an analogous 
construction to the one we present in this section by changing the sign of Cj(x) and fo(x). The approximate 
linear programming constraints of (8) can also be formulated in this form, as we show in Section 5. 
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that the difference between ^jtOj q( x ) an d fe(x) is maximal. However, we cannot explic- 
itly enumerate the exponential number of states and compute the difference. Fortunately, 
structure in factored MDPs allows us to compute this maximum efficiently. 

In the case of factored MDPs, our state space is a set of vectors x which are assign- 
ments to the state variables X = {X\, . . . , X n }. We can view both Cw and b as functions 
of these state variables, and hence also their difference. Thus, we can define a function 
F w (Xl, . . . , X n ) such that F w (x) = J2i w i q( x ) — K x )- Note that we have executed a 
representation shift; we are viewing F w as a function of the variables X, which is pa- 
rameterized by w. Recall that the size of the state space is exponential in the number 
of variables. Hence, our goal in this section is to compute max x i ?w (x) without explicitly 
considering each of the exponentially many states. The solution is to use the fact that F w 
has a factored representation. More precisely, Cw has the form J2i w i c «(Zi), where Zj is 
a subset of X. For example, we might have ci(X\,X 2 ) which takes value 1 in states where 
X\ = true and X2 = false and otherwise. Similarly, the vector b in our case is also a sum 
of restricted scope functions. Thus, we can express F w as a sum J2j /^(Zj), where fj' may 
or may not depend on w. In the future, we sometimes drop the superscript w when it is 
clear from context. 

Using our more compact notation, our goal here is simply to compute max x ^ Wi Cj(x) — 
6(x) = max x F w (x), that is, to find the state x over which F w is maximized. Recall that 
F w = J2J=i fj{Zj)- We can maximize such a function, F w , without enumerating every state 
using non-serial dynamic programming (Bertele & Brioschi, 1972). The idea is virtually 
identical to variable elimination in a Bayesian network. We review this construction here, 
as it is a central component in our solution LP. 

Our goal is to compute 



The main idea is that, rather than summing all functions and then doing the maximization, 
we maximize over variables one at a time. When maximizing over x;, only summands 
involving x\ participate in the maximization. 

Example 4.2 Assume 

F = fi{xi,x 2 ) + f2(x 1 ,x 3 ) + / 3 (x 2 ,x 4 ) + / 4 (x 3 ,x 4 ). 
We therefore wish to compute: 

max fi(x 1 ,x 2 ) + f 2 (xi,x 3 ) + h{x 2 ,Xi) + / 4 (x 3 ,x 4 ). 

We can first compute the maximum over x 4 ; the functions f\ and f 2 are irrelevant, so we 
can push them out. We get 



The result of the internal maximization depends on the values ofx 2 ,x 3 ; thus, we can intro- 
duce a new function e\{X 2 ,X 3 ) whose value at the point x 2 ,x 3 is the value of the internal 
max expression. Our problem now reduces to computing 



max 




.1 



max f\{xi,x 2 ) + f 2 (xi,x 3 ) + max[/ 3 (x 2 , x 4 ) + / 4 (x 3 ,x 4 )]. 

1,X2,X3 X4 



max /i(xi,x 2 ) + /2(xi,x 3 ) + ei(x 2 ,x 3 ) 
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VariableElimination (JF, O) 




/ /T — {/i, . . . , fm} is the set of functions to be maximized; 




//O stores the elimination order. 




For i = 1 TO NUMBER OF VARIABLES: 




//Select the next variable to be eliminated. 




Let I = O(i) ; 




//Select the relevant functions. 




Let ei, . . . , ez, be the functions in T whose scope contains Xi. 


//Maximize over current variable X\. 




Define A new function e = max X! e j \ 


NOTE THAT Scope[e] = 


ujj^Scopefo] - {X t }. 




//Update set of functions. 




Update THE SET OF FUNCTIONS T = T U {e} \ {ei, . 


■ ,e L }- 


//Now, all functions have empty scope and their sum is the maximum value of fj + ■ ■ ■ + f m . 


Return the maximum value J^e-er 6 *- 





Figure 4: Variable elimination procedure for computing the maximum value /i H h /, 

where each /; is a restricted scope function. 



having one fewer variable. Next, we eliminate another variable, say X3, with the resulting 
expression reducing to: 

max/i(xi,x 2 ) + e 2 (xi,x 2 ), 

X!,X 2 

where e 2 (xi,x 2 ) = max[/ 2 (xi, x 3 ) + ei(x 2 , x 3 )]. 

xz 

Finally, we define 

e 3 = max/i(xi,x 2 ) + e 2 (xi,x 2 ). 

X1,X2 

The result at this point is a number, which is the desired maximum over x±, . . . ,x±. While 
the naive approach of enumerating all states requires 63 arithmetic operations if all variables 
are binary, using variable elimination we only need to perform 23 operations. □ 

The general variable elimination algorithm is described in Figure 4. The inputs to 
the algorithm are the functions to be maximized J- = {/1, • • • , / m } and an elimination 
ordering O on the variables, where 0(i) returns the ith variable to be eliminated. As in 
the example above, for each variable X\ to be eliminated, we select the relevant functions 
ei, . . . , eL, those whose scope contains X[. These functions are removed from the set T and 
we introduce a new function e = max X; J2j=i e j- At this point, the scope of the functions in 
T no longer depends on Xi, that is, X\ has been 'eliminated'. This procedure is repeated 
until all variables have been eliminated. The remaining functions in T thus have empty 
scope. The desired maximum is therefore given by the sum of these remaining functions. 

The computational cost of this algorithm is linear in the number of new "function 
values" introduced in the elimination process. More precisely, consider the computation of 
a new function e whose scope is Z. To compute this function, we need to compute |Dom[Z]| 
different values. The cost of the algorithm is linear in the overall number of these values, 
introduced throughout the execution. As shown by Dechter (1999), this cost is exponential 
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in the induced width of the cost network, the undirected graph defined over the variables 
X±, . . . , X n , with an edge between Xi and X m if they appear together in one of the original 
functions fj. The complexity of this algorithm is, of course, dependent on the variable 
elimination order and the problem structure. Computing the optimal elimination order 
is an NP-hard problem (Arnborg, Corneil, & Proskurowski, 1987) and elimination orders 
yielding low induced tree width do not exist for some problems. These issues have been 
confronted successfully for a large variety of practical problems in the Bayesian network 
community, which has benefited from a large variety of good heuristics which have been 
developed for the variable elimination ordering problem (Bertele &, Brioschi, 1972; Kjaerulff, 
1990; Reed, 1992; Becker & Geiger, 2001). 

4.2.2 Factored LP 

In this section, we present the centerpiece of our planning algorithms: a new, general 
approach for compactly representing exponentially large sets of LP constraints in problems 
with factored structure — those where the functions in the constraints can be decomposed 
as the sum of restricted scope functions. Consider our original problem of representing 
the non-linear constraint in Equation (12) compactly. Recall that we wish to represent 
the non-linear constraint 4> > rnax x J]j^j q(x) — 6(x), or equivalently, <p > max x F w (x), 
without generating one constraint for each state as in Equation (11). The new, key insight 
is that this non-linear constraint can be implemented using a construction that follows the 
structure of variable elimination in cost networks. 

Consider any function e used within T (including the original /«'s), and let Z be its scope. 
For any assignment z to Z, we introduce variable u%, whose value represents e z , into the 
linear program. For the initial functions ff, we include the constraint that Ux = /^(z). As 
fy" is linear in w, this constraint is linear in the LP variables. Now, consider a new function 
e introduced into T by eliminating a variable X\. Let e±, . . . , be the functions extracted 
from T, and let Z be the scope of the resulting e. We introduce a set of constraints: 

L 

Let e n be the last function generated in the elimination, and recall that its scope is empty. 
Hence, we have only a single variable u e ". We introduce the additional constraint (f> > u e ". 

The complete algorithm, presented in Figure 5, is divided into three parts: First, we 
generate equality constraints for functions that depend on the weights Wi (basis functions). 
In the second part, we add the equality constraints for functions that do not depend on the 
weights (target functions). These equality constraints let us abstract away the differences 
between these two types of functions and manage them in a unified fashion in the third 
part of the algorithm. This third part follows a procedure similar to variable elimination 
described in Figure 4. However, unlike standard variable elimination where we would in- 
troduce a new function e, such that e = max Xl 2jf=i e ji i n our factored LP procedure we 
introduce new LP variables u%. To enforce the definition of e as the maximum over X\ of 
J2j=i e ji we introduce the new LP constraints in Equation (13). 

Example 4.3 To understand this construction, consider our simple example above, and 
assume we want to express the fact that 4> > max x F w (x) . We first introduce a set of 
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FactoredLP (C, h,0) 

// C = {ci, . . . ,Ck} is the set of basis functions. 
// b = {61, . . . , bm} is the set of target functions. 
//O stores the elimination order. 

//Return a (polynomial) set of constraints Q, equivalent to <fi > ^\ w>iCj(x) + ^ &j(x),Vx . 
/ /Data structure for the constraints in factored LP. 
Let n = {} . 

//Data structure for the intermediate functions generated in variable elimination. 
Let T = {} . 

//Generate equality constraint to abstract away basis functions. 
For EACH a e C: 

Let Z = Scope [a]. 

For EACH ASSIGNMENT Z e Z, CREATE A NEW LP VARIABLE u[* AND ADD A 
CONSTRAINT TO 0: 

u£ = WiCi{z). 

Store NEW FUNCTION fi TO USE IN VARIABLE ELIMINATION STEP: T = T \J {fi}. 
//Generate equality constraint to abstract away target functions. 
For EACH bj eb: 

Let Z = Scope[6j]. 

For EACH ASSIGNMENT Z G Z, CREATE A NEW LP VARIABLE wf 3 AND ADD A 
CONSTRAINT TO 0: 

u{ j = bj(z). 

Store NEW FUNCTION fj TO USE IN VARIABLE ELIMINATION STEP: T = J"U {fj}. 
//Now, T contains all of the functions involved in the LP, our constraints become: <f> > 

ej(x),Vx , which we represent compactly using a variable elimination procedure. 
For i — 1 TO NUMBER OF VARIABLES: 

//Select the next variable to be eliminated. 

Let I = 0(i) ; 

//Select the relevant functions. 

Let e\,...,6L BE THE FUNCTIONS IN T WHOSE SCOPE CONTAINS Xi, AND LET 

Zj = Scope[ej]. 

//Introduce linear constraints for the maximum over current variable Xi. 

Define A new function e with scope Z = U^ =1 Zj - {X{\ to represent 

max ii 2^j=i e j ■ 

Add constraints to £1 TO enforce maximum: for each assignment z e Z: 

L 

//Update set of functions. 

Update the set of functions T = T U {e} \ {ei, . . . , e L }. 
//Now, all variables have been eliminated and all functions have empty scope. 
Add LAST CONSTRAINT to fi: 

Return Vl. 

Figure 5: Factored LP algorithm for the compact representation of the exponential set of 
constraints 4> >J2i ^iCj(x) + J2j &j(x), Vx. 
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variables u x \ X2 for every instantiation of values X\,x<i to the variables X\,X<i- Thus, if 
X\ and X2 are both binary, we have four such variables. We then introduce a constraint 
defining the value of u^ x \ X2 appropriately. For example, for our f\ above, we have u{\ = 
and u^ = w\ . We have similar variables and constraints for each fj and each value z in 
Zj. Note that each of the constraints is a simple equality constraint involving numerical 
constants and perhaps the weight variables w. 

Next, we introduce variables for each of the intermediate expressions generated by vari- 
able elimination. For example, when eliminating X4, we introduce a set of LP variables 
u x2,x 3 ' f or eac ^ °f the m > we have a set of constraints 

?/ ei > 7/^ 3 -I- 7/^ 4 

X2,X3 — X2,X4 ' ^X3,X4 

one for each value X4 of X4. We have a similar set of constraint for u%\ X2 in terms of 
u l\ x 3 an d u x\ X3 ■ Note that each constraint is a simple linear inequality. □ 

We can now prove that our factored LP construction represents the same constraint as 
non- linear constraint in Equation (12): 

Theorem 4.4 The constraints generated by the factored LP construction are equivalent to 
the non-linear constraint in Equation (12). That is, an assignment to (0, w) satisfies the 
factored LP constraints if and only if it satisfies the constraint in Equation (12). 

Proof: See Appendix A. 3. □ 

Returning to our original formulation, we have that Y^j fj' is C w — b in the original 
set of constraints. Hence our new set of constraints is equivalent to the original set: <fi > 
m.ax. x J2i w i c i( x ) — K x ) m Equation (12), which in turn is equivalent to the exponential 
set of constraints (f) > J2iWi q(x) — 6(x), Vx in Equation (11). Thus, we can represent this 
exponential set of constraints by a new set of constraints and LP variables. The size of 
this new set, as in variable elimination, is exponential only in the induced width of the cost 
network, rather than in the total number of variables. 

In this section, we presented a new, general approach for compactly representing expo- 
nentially-large sets of LP constraints in problems with factored structure. In the remainder 
of this paper, we exploit this construction to design efficient planning algorithms for factored 
MDPs. 

4.2.3 Factored Max-norm Projection 

We can now use our procedure for representing the exponential number of constraints in 
Equation (11) compactly to compute efficient max-norm projections, as in Equation (4): 

w* € argmin IICw — bll . 

The max-norm projection is computed by the linear program in (5). There are two sets 
of constraints in this LP: (p > J2 l j=i c ij w j ~ and (p > b{ — Y^^=i c ij w ji^^- Each of 

these sets is an instance of the constraints in Equation (11), which we have just addressed 
in the previous section. Thus, if each of the k basis functions in C is a restricted scope 
function and the target function b is the sum of restricted scope functions, then we can 
use our factored LP technique to represent the constraints in the max-norm projection LP 
compactly. The correctness of our algorithm is a corollary of Theorem 4.4: 
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Corollary 4.5 The solution ((f)* , w*) of a linear program that minimizes <fi subject to the 
constraints in FactoredLP (C, — h,0) and FactoredLP^— C , h,0), for any elimination 
order O satisfies: 

w* G argmin IICw — bll , and 6* = min IICw — bll . □ 

The original max- norm projection LP had k + 1 variables and two constraints for each 
state x; thus, the number of constraints is exponential in the number of state variables. 
On the other hand, our new factored max-norm projection LP has more variables, but 
exponentially fewer constraints. The number of variables and constraints in the new factored 
LP is exponential only in the number of state variables in the largest factor in the cost 
network, rather than exponential in the total number of state variables. As we show in 
Section 9, this exponential gain allows us to compute max-norm projections efficiently when 
solving very large factored MDPs. 

5. Approximate Linear Programming 

We begin with the simplest of our approximate MDP solution algorithms, based on the 
approximate linear programming formulation in Section 3.3. Using the basic operations 
described in Section 4, we can formulate an algorithm that is both simple and efficient. 

5.1 The Algorithm 

As discussed in Section 3.3, approximate linear program formulation is based on the linear 
programming approach to solving MDPs presented in Section 3.3. However, in this ap- 
proximate version, we restrict the space of value functions to the linear space defined by 
our basis functions. More precisely, in this approximate LP formulation, the variables are 
w\, . . . ,Wk — the weights for our basis functions. The LP is given by: 

Variables: w\, . . . ,Wk ; 
Minimize: J2x a ( x ) Si w i ^i( x ) j 

Subject to: J2i m hi (x) > [i?(x, a) + 7 £ x , P(x' | x, a) J2i ™i h i ( x 01 Vx G X, Va G A. 

(14) 

In other words, this formulation takes the LP in (7) and substitutes the explicit state value 
function with a linear value function representation J2i w i hi (x) . This transformation from 
an exact to an approximate problem formulation has the effect of reducing the number 
of free variables in the LP to k (one for each basis function coefficient), but the number 
of constraints remains |X| x \A\. In our Sys Admin problem, for example, the number of 
constraints in the LP in (14) is (m + 1) ■ 2 m , where m is the number of machines in the 
network. However, using our algorithm for representing exponentially large constraint sets 
compactly we are able to compute the solution to this approximate linear programming 
algorithm in closed form with an exponentially smaller LP, as in Section 4.2. 

First, consider the objective function J2x a ( :x -)J2i w i ^«( x ) 01 the LP (14). Naively 
representing this objective function requires a summation over a exponentially large state 
space. However, we can rewrite the objective and obtain a compact representation. We 
first reorder the terms: 
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FactoredALP (P, R, 7, H, O, a) 

//P is the factored transition model. 

//R is the set of factored reward functions. 

/ /j is the discount factor. 

//H is the set of basis functions H = {hi, . . . , h k }. 
//O stores the elimination order, 
//a are the state relevance weights. 

//Return the basis function weights w computed by approximate linear programming. 
//Cache the backprojections of the basis functions. 
For EACH BASIS FUNCTION hi E H; FOR EACH ACTION a: 

Let gf = Backproj a (hi). 
//Compute factored state relevance weights. 

For EACH BASIS FUNCTION hi, COMPUTE THE FACTORED STATE RELEVANCE WEIGHTS 

cti as in Equation (15) . 

//Generate approximate linear programming constraints 

Let = {}. 

For EACH ACTION a: 

Let n = n U FactoredLP({7. 9 5 1 - h u . . . , jg% - h k }, R a , O). 
//So far, our constraints guarantee that <f> > 7i(x,o) + 7^ x ' -P( x ' I x > a ) X/i Wi ' l »( x ') — 
u>; hi(x); to satisfy the approximate linear programming solution in (14) we must add 
a final constraint. 
Let = U {</> = 0}. 

//We can now obtain the solution weights by solving an LP. 

Let W BE THE SOLUTION OF THE LINEAR PROGRAM: MINIMIZE Y,i a i w ii SUBJECT TO 
THE CONSTRAINTS 

Return w. 



Figure 6: Factored approximate linear programming algorithm. 
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x i i x 

Now, consider the state relevance weights a(x) as a distribution over states, so that a(x) > 
and 5Z X o( x ) = 1- As in backprojections, we can now write: 

0Li = Xl a ( x ) ^( x ) = H a ( c ») ^( c «); ( 15 ) 

x c,ec, 

where a(cj) represents the marginal of the state relevance weights a over the domain 
Dom[Cj] of the basis function hi. For example, if we use uniform state relevance weights as 
in our experiments — a(x) = j%| — then the marginals become a(cj) = y^-y . Thus, we can 
rewrite the objective function as J2i w i a i-> where each basis weight aj is computed as shown 
in Equation (15). If the state relevance weights are represented by marginals, then the cost 
of computing each a* depends exponentially on the size of the scope of Cj only, rather than 
exponentially on the number of state variables. On the other hand, if the state relevance 
weights are represented by arbitrary distributions, we need to obtain the marginals over the 
Cj's, which may not be an efficient computation. Thus, greatest efficiency is achieved by 
using a compact representation, such as a Bayesian network, for the state relevance weights. 

Second, note that the right side of the constraints in the LP (14) correspond to the Q a 
functions: 

Q tt (x) = iT(x) + 7 ]T P(x' | x, a) ^ Wi hitf). 

x' i 

Using the efficient backprojection operation in factored MDPs described in Section 4.1 we 
can rewrite the Q a functions as: 

Q (x)=iF(x)+ 7 X>iS?(x); 

i 

where gf is the backprojection of basis function hi through the transition model P a . As we 
discussed, if hi has scope restricted to Cj, then gf is a restricted scope function of r a (C^). 

We can precompute the backprojections gf and the basis relevance weights aj. The 
approximate linear programming LP of (14) can be written as: 

Variables: w\ , . . . , Wk ', 

Minimize: J2i a i w i I (16) 
Subject to: Wi ^(x) > [i? a (x) + 7^ m c/f(x)] Vx G X,Va G A. 

Finally, we can rewrite this LP to use constraints of the same form as the one in Equa- 
tion (12): 

Variables: w\ , . . . , Wk ; 

Minimize: J2i a i w i j (17) 
Subject to: > max x {R a (x) + J2i ™i [7^( x ) ~ ^j( x )]} Va G A. 

We can now use our factored LP construction in Section 4.2 to represent these non-linear 
constraints compactly. Basically, there is one set of factored LP constraints for each action 
a. Specifically, we can write the non-linear constraint in the same form as those in Equa- 
tion (12) by expressing the functions C as: q(x) = hi (x) — 75" (x) . Each Cj(x) is a restricted 
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scope function; that is, if faj(x) has scope restricted to Cj, then gf (x) has scope restricted 
to r a (CQ, which means that q(x) has scope restricted to Cj U r a (CQ. Next, the target 
function b becomes the reward function R a (x) which, by assumption, is factored. Finally, 
in the constraint in Equation (12), <p is a free variable. On the other hand, in the LP in (17) 
the maximum in the right hand side must be less than zero. This final condition can be 
achieved by adding a constraint <p = 0. Thus, our algorithm generates a set of factored 
LP constraints, one for each action. The total number of constraints and variables in this 
new LP is linear in the number of actions \A\ and only exponential in the induced width 
of each cost network, rather than in the total number of variables. The complete factored 
approximate linear programming algorithm is outlined in Figure 6. 

5.2 An Example 

We now present a complete example of the operations required by the approximate LP algo- 
rithm to solve the factored MDP shown in Figure 2(a). Our presentation follows four steps: 
problem representation, basis function selection, backprojections and LP construction. 

Problem Representation: First, we must fully specify the factored MDP model for the 
problem. The structure of the DBN is shown in Figure 2(b). This structure is maintained 
for all action choices. Next, we must define the transition probabilities for each action. 
There are 5 actions in this problem: do nothing, or reboot one of the 4 machines in the 
network. The CPDs for these actions are shown in Figure 2(c). Finally, we must define the 
reward function. We decompose the global reward as the sum of 4 local reward functions, 
one for each machine, such that there is a reward if the machine is working. Specifically, 
Ri(Xi = true) = 1 and Ri{Xi = false) = 0, breaking symmetry by setting R^X^ = true) = 
2. We use a discount factor of 7 = 0.9. 

Basis Function Selection: In this simple example, we use five simple basis functions. 
First, we include the constant function ho = 1. Next, we add indicators for each machine 
which take value 1 if the machine is working: hi(Xi = true) = 1 and hi(Xi = false) = 0. 

Backprojections: The first algorithmic step is computing the backprojection of the 
basis functions, as defined in Section 4.1. The backprojection of the constant basis is 
simple: 

g a Q = ]TPa(x' |x)ho ; 

x' 

= ]TP a (x'|x)l; 

x' 

= 1 . 

Next, we must backproject our indicator basis functions hi: 
gt = £P a (x'|x)^) ; 

x' 

X) Il P «( x j I Xj-i,Xj)hi(x'i) ; 
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x[ x'[X'-{X?}] j+i 

= ^ ] Pg{%j I Xi—\jXi)hi{Xi) ; 

= P a (A~- = £rue | Xj_i, x«) 1 + -Pa(A~- = /a/se | Xj_i, Xi) ; 
= P a {X[ = true | Xj_i,Xj) . 

Thus, gf is a restricted scope function of {Aj__i,Aj}. We can now use the CPDs in Fig- 
ure 2(c) to specify g®: 



^reboot = i 



9: 



reboot =fc i 



(Xi-i,Xi) 



(Xi-i,Xi) 





Xi = true 


Xi = false 


Xi-i = true 


1 


1 


Xi-i = false 


1 


1 






Xi = true 


Xi = false 


= true 


0.9 


0.09 


Xj_i = false 


0.5 


0.05 



LP Construction: To illustrate the factored LPs constructed by our algorithms, we 
define the constraints for the approximate linear programming approach presented above. 
First, we define the functions c" = jgf — hi, as shown in Equation (17). In our example, 
these functions are eg = 7 — 1 = —0.1 for the constant basis, and for the indicator bases: 



reboot = i / 



^reboot / i 







Xi = true 


Xi = false 


% {Xi-\,Xi) — 


Xi-i = true 


-0.1 


0.9 




= false 


-0.1 


0.9 



(Xi-\,Xi) 





Xi = true 


Xi = false 


Xi^\ = true 


-0.19 


0.081 


Xi-\ = false 


-0.55 


0.045 



Using this definition of cf , the approximate linear programming constraints are given by: 

> max^ Ri + w j c j > Va • ( 18 ) 



We present the LP construction for one of the 5 actions: reboot = 1. Analogous constructions 
can be made for the other actions. 

In the first set of constraints, we abstract away the difference between rewards and basis 
functions by introducing LP variables u and equality constraints. We begin with the reward 
functions: 



,R3 _ 



u 



a-3 



1,^=0; 
1,^=0; 



u 
u 



R2 = . 

t = 2 , u R * 



i , 4 2 = o 



X'4 



. 



We now represent the equality constraints for the c" functions for the reboot = 1 action. Note 
that the appropriate basis function weight from Equation (18) appears in these constraints: 
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u 



u c o = -o.l w ; 
-0.1 wi 



ci 



U 



C2 



— 0.19 W2 , U 



U- 1 
"'Xi ,X4 

C_2 

X1,X2 



U x\,xz ~ 0.19 ^3 i u x 2 ,X3 — 



0.9 w\ 

-0.55 w 2 , u xiyl , 
/ 3 - 

A X2,X3 



= —0.1 u>i , u^\,s d = 0-9 w^i ; 

,., , , - 0.045 w 2 ; 

X2,X3 



'X\,X4 "'l > w Xl,X4 

■ C2 - 0.081 U> 2 , % 2 " 
-0.55 w 3 , u% 5 , = 0.081 u> 3 , u| 3 „ 5 = 0.045 u>3 ; 



<%,X4 = -°- 19 «>4 , t# 3)X4 = -0.55 u/ 4 , u? 3iS4 = 0.081 w 4 , ug 4 3 , S4 = 0.045 w 4 • 

Using these new LP variables, our LP constraint from Equation (18) for the reboot = 1 action 
becomes: 



> 



max 

Xi,X2,X3,X4 



i=i 



+ u c ° + ^ ■ 



We are now ready for the variable elimination process. We illustrate the elimination of 
variable X4: 



> max V u x % + u°° + V u c Y x + 

~ X U X 2 ,X 3 ^ A ' ^ A J-i' A J 



i=i 



We can represent the term maxx 4 



J=2 

4: 



max 

X4 



u x\ + U Xi,X 4 + U X 3 .\ 1 



,C4 



-Xi,X 4 



c 4 

^3,^4 



by a set of linear constraints, 



one for each assignment of X\ and X3 , using the new LP variables u Xi Xi to represent this 
maximum: 



ei 

Xl,X 3 


> 


U X4 


+ n Cl 

X \,X4 


+ u° 4 

1 U 'X3,X4 


ei 

X1,X3 


> 




— 1— n C l _ 

^ u 'Xl,X4 


+ u Ci - 

' U, X3,X4 


ei 

XI, X3 


> 


< 4 


+ u^ 1 

^ X \,X4 


+ u° 4 

1 U 'X3,X4 


ei 

Xl,X 3 


> 




+ <* 4 


+ u C4 - 

' U 'X3,X4 


ei 

XI, X3 


> 


< 4 


+ u Cl 

1 u 'Xl,X4 


+ u- 4 

' U 'X3,X4 


ei 

Xi, X3 


> 


4 4 


+ u ci - 

T U 'X\,X4 


+ U- 4 - 

' U, X 3 ,X4 


ei 

XI, X3 


> 


U X4 


+ u- 1 

T U 'X1,X4 


+ u- 4 


ei 

X1,X3 


> 


R4 
U X4 


+ U- 1 - 

' U, X\,X4 


+ u c - 4 - 

~ U X3,X4 



We have now eliminated variable X4 and our global non-linear constraint becomes: 

3 3 



> max V uy 1 + u c ° + V 

-Xux 2 ,x 3 ^ * ^ 



c j j ei 

U x j _ 1 ,x j + u x 1 ,x 3 ■ 



Next, we eliminate variable X3. The new LP constraints and variables have the form: 

u% X2 > u*l + + u XijX3 , V X U X 2 ,X S ; 

thus, removing X% from the global non- linear constraint: 



> max ]T u% + u c ° + + u5 1>Xa . 

' i=l 
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Figure 7: Number of constraints in the LP generated by the explicit state representation 
versus the factored LP construction for the solution of the ring problem with 
basis functions over single variables and approximate linear programming as the 
solution algorithm. 



We can now eliminate X2, generating the linear constraints: 

Now, our global non- linear constraint involves only X±: 

> max Uy 1 + u c ° + u% . 

As Xi is the last variable to be eliminated, the scope of the new LP variable is empty and 
the linear constraints are given by: 

n 64 > u x \ + u% , V X 1 . 

All of the state variables have now been eliminated, turning our global non-linear constraint 
into a simple linear constraint: 

> u C0 + u 64 , 

which completes the LP description for the approximate linear programming solution to 
the problem in Figure 2. 

In this small example with only four state variables, our factored LP technique generates 
a total of 89 equality constraints, 115 inequality constraints and 149 LP variables, while 
the explicit state representation in Equation (8) generates only 80 inequality constraints 
and 5 LP variables. However, as the problem size increases, the number of constraints and 
LP variables in our factored LP approach grow as 0(n 2 ), while the explicit state approach 
grows exponentially, at 0(n2 n ). This scaling effect is illustrated in Figure 7. 

6. Approximate Policy Iteration with Max-norm Projection 

The factored approximate linear programming approach described in the previous section 
is both elegant and easy to implement. However, we cannot, in general, provide strong 
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guarantees about the error it achieves. An alternative is to use the approximate policy 
iteration described in Section 3.2, which does offer certain bounds on the error. However, 
as we shall see, this algorithm is significantly more complicated, and requires that we place 
additional restrictions on the factored MDP. 

In particular, approximate policy iteration requires a representation of the policy at each 
iteration. In order to obtain a compact policy representation, we must make an additional 
assumption: each action only affects a small number of state variables. We first state this 
assumption formally. Then, we show how to obtain a compact representation of the greedy 
policy with respect to a factored value function, under this assumption. Finally, we describe 
our factored approximate policy iteration algorithm using max-norm projections. 

6.1 Default Action Model 

In Section 2.2, we presented the factored MDP model, where each action is associated with 
its own factored transition model represented as a DBN and with its own factored reward 
function. However, different actions often have very similar transition dynamics, only dif- 
fering in their effect on some small set of variables. In particular, in many cases a variable 
has a default evolution model, which only changes if an action affects it directly (Boutilier 
et al, 2000). 

This type of structure turns out to be useful for compactly representing policies, a prop- 
erty which is important in our approximate policy iteration algorithm. Thus, in this section 
of the paper, we restrict attention to factored MDPs that are defined using a default transi- 
tion model Td = (Gd, Pa) (Koller & Parr, 2000). For each action a, we define Effects[a] C X' 
to be the variables in the next state whose local probability model is different from Td, i.e., 
those variables X[ such that P a {X[ | Parents a {X' i )) ^ Pd{X[ \ ParentSd(X-)). 

Example 6.1 In our system administrator example, we have an action a« for rebooting 
each one of the machines, and a default action d for doing nothing. The transition model 
described above corresponds to the "do nothing" action, which is also the default transition 
model. The transition model for a« is different from d only in the transition model for the 
variable X[, which is now X[ = true with probability one, regardless of the status of the 
neighboring machines. Thus, in this example, Effects[ai] = X[. □ 

As in the transition dynamics, we can also define the notion of default reward model. In 
this case, there is a set of reward functions J2~i=i ^i(Uj) associated with the default action 
d. In addition, each action a can have a reward function i? a (U a ). Here, the extra reward of 
action a has scope restricted to Rewards[a] = XJf C {X\, . . . ,X n }. Thus, the total reward 
associated with action a is given by R a + J2i=i Pi- Note that R a can also be factored as a 
linear combination of smaller terms for an even more compact representation. 

We can now build on this additional assumption to define the complete algorithm. 
Recall that the approximate policy iteration algorithm iterates through two steps: policy 
improvement and approximate value determination. We now discuss each of these steps. 

6.2 Computing Greedy Policies 

The policy improvement step computes the greedy policy relative to a value function y^ -1 ); 

vrM = Greedy{V (t -^). 
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Recall that our value function estimates have the linear form Hw. As we described in 
Section 4.1, the greedy policy for this type of value function is given by: 

Greedy(Hw) (x) = arg max Q a (x) , 

where each Q a can be represented by: Q a ( x ) = o) + J2i w i 9f ( x )- 

If we attempt to represent this policy naively, we are again faced with the problem 
of exponentially large state spaces. Fortunately, as shown by Koller and Parr (2000), the 
greedy policy relative to a factored value function has the form of a decision list. More 
precisely, the policy can be written in the form (ti, a±), (tz, 02), . . . , (t£, a^), where each tj 
is an assignment of values to some small subset Tj of variables, and each Oj is an action. 
The greedy action to take in state x is the action a,j corresponding to the first event tj in 
the list with which x is consistent. For completeness, we now review the construction of 
this decision-list policy. 

The critical assumption that allows us to represent the policy as a compact decision list 
is the default action assumption described in Section 6.1. Under this assumption, the Q a 
functions can be written as: 

r 

Q„(x) = i? a (x) + fliM + E w * ^(x), 

i=l i 

where R a has scope restricted to U a . The Q function for the default action d is just: 

Q d (x) = ELi^W + E i ^5f(x). 

We now have a set of linear Q-functions which implicitly describes a policy ir. It is 
not immediately obvious that these Q functions result in a compactly expressible policy. 
An important insight is that most of the components in the weighted combination are 
identical, so that gf is equal to gf for most i. Intuitively, a component gf corresponding 
to the backprojection of basis function /ij(Cj) is only different if the action a influences 
one of the variables in Cj. More formally, assume that Effects[a] n Cj = 0. In this case, 
all of the variables in Cj have the same transition model in r a and Thus, we have 
that gf (x) = <?f(x); in other words, the ith component of the Q a function is irrelevant 
when deciding whether action a is better than the default action d. We can define which 
components are actually relevant: let I a be the set of indices i such that Effects[a] flCj 7^ 0. 
These are the indices of those basis functions whose backprojection differs in P a and P^. 
In our example DBN of Figure 2, actions and basis functions involve single variables, so 
hi = i. 

Let us now consider the impact of taking action a over the default action d. We can 
define the impact — the difference in value — as: 

<5„(x) = Q (x) - Q d (x); 

= i? a (x) + 5>, [ 5 f(x)- 5 f(x)]. (19) 

iei a 

This analysis shows that <5 a (x) is a function whose scope is restricted to 

T a = u a u [u ieJo r a (c;)] . (20) 
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DECISIONLlSTPOLICY (Q a ) 

/ /Qa is the set of Q-functions, one for each action; 
//Return the decision list policy A. 
//Initialize decision list. 
Let A = {}. 

//Compute the bonus functions. 

For EACH ACTION a, OTHER THAN THE DEFAULT ACTION d\ 
Compute THE BONUS FOR TAKING ACTION a, 

(5 a (x) = Qa(x) - Qd(x); 

as in Equation (19). Note that S a has scope restricted to T a , as in 
Equation (20). 

//Add states with positive bonuses to the (unsorted) decision list. 
For EACH ASSIGNMENT t £ T a : 

If S a (t) > 0, ADD BRANCH TO DECISION LIST: 

A = Au{(t,aA(t)}}. 

//Add the default action to the (unsorted) decision list. 

Let A = Au{(0,d,O)}. 

//Sort decision list to obtain final policy. 

Sort THE DECISION LIST A IN DECREASING ORDER ON THE S ELEMENT OF (t,a,S). 

Return A. 



Figure 8: Method for computing the decision list policy A from the factored representation 
of the Q a functions. 



In our example DBN, T a2 = {X 1 ,X 2 }. 

Intuitively, we now have a situation where we have a "baseline" value function Qd(x) 
which defines a value for each state x. Each action a changes that baseline by adding or 
subtracting an amount from each state. The point is that this amount depends only on T a , 
so that it is the same for all states in which the variables in T a take the same values. 

We can now define the greedy policy relative to our Q functions. For each action a, define 
a set of conditionals (t, a, 5), where each t is some assignment of values to the variables T a , 
and S is 5 a (t). Now, sort the conditionals for all of the actions by order of decreasing 5: 

(ti, ai, Si), (t 2 , 02,62), ... , (tL,aL, 8l)- 

Consider our optimal action in a state x. We would like to get the largest possible "bonus" 
over the default value. If x is consistent with ti, we should clearly take action cti, as it 
gives us bonus 5\. If not, then we should try to get 62] thus, we should check if x is 
consistent with t 2 , and if so, take a 2 . Using this procedure, we can compute the decision- 
list policy associated with our linear estimate of the value function. The complete algorithm 
for computing the decision list policy is summarized in Figure 8. 

Note that the number of conditionals in the list is J2 a |Dom(T a )|; T a , in turn, depends 
on the set of basis function clusters that intersect with the effects of a. Thus, the size 
of the policy depends in a natural way on the interaction between the structure of our 
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process description and the structure of our basis functions. In problems where the actions 
modify a large number of variables, the policy representation could become unwieldy. The 
approximate linear programming approach in Section 5 is more appropriate in such cases, 
as it does not require an explicit representation of the policy. 

6.3 Value Determination 

In the approximate value determination step our algorithm computes: 

w (t) = argmin ||Hw - (R^t) + tP^Hw)^ . 

By rearranging the expression, we get: 

w (i) = arg min || (H - 7 P ffW H) w - R w(t) ^ . 

This equation is an instance of the optimization in Equation (4). If P^t) is factored, we can 
conclude that C = (H — jP n ( t )H) is also a matrix whose columns correspond to restricted- 
scope functions. More specifically: 

cj(x) = /ij(x) -7fff (0 (x), 

where gf t] is the backprojection of the basis function hi through the transition model P n w , 
as described in Section 4.1. The target b = R n (t) corresponds to the reward function, which 
for the moment is assumed to be factored. Thus, we can again apply our factored LP in 
Section 4.2.3 to estimate the value of the policy tt^\ 

Unfortunately, the transition model P^(t) is not factored, as a decision list representa- 
tion for the policy tt^ will, in general, induce a transition model P^(t) which cannot be 
represented by a compact DBN. Nonetheless, we can still generate a compact LP by ex- 
ploiting the decision list structure of the policy. The basic idea is to introduce cost networks 
corresponding to each branch in the decision list, ensuring, additionally, that only states 
consistent with this branch are considered in the cost network maximization. Specifically, 
we have a factored LP construction for each branch (tj,Oj). The ith cost network only 
considers a subset of the states that is consistent with the ith branch of the decision list. 
Let Si be the set of states x such that tj is the first event in the decision list for which x 
is consistent. That is, for each state x £ Si, x is consistent with tj, but it is not consistent 
with any tj with j < i. 

Recall that, as in Equation (11), our LP construction defines a set of constraints that 
imply that 4> > J2i w i c «( x ) — &( x ) f° r each state x. Instead, we have a separate set of 
constraints for the states in each subset Si. For each state in Si, we know that action a% is 
taken. Hence, we can apply our construction above using P ai — a transition model which is 
factored by assumption — in place of the non-factored P n (t) ■ Similarly, the reward function 
becomes R ai (x) + Y7i=i Pi( x ) f° r this subset of states. 

The only issue is to guarantee that the cost network constraints derived from this tran- 
sition model are applied only to states in Si. Specifically, we must guarantee that they are 
applied only to states consistent with tj, but not to states that are consistent with some 
tj for j < i. To guarantee the first condition, we simply instantiate the variables in Tj to 
take the values specified in tj. That is, our cost network now considers only the variables in 
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FactoredAPI (P, R, 7, H, O, e, t max ) 

//P is the factored transition model. 

//R is the set of factored reward functions. 

/ /j is the discount factor. 

//H is the set of basis functions H = {hi, . . . , h k }. 

//O stores the elimination order. 

/ /e Bellman error precision. 

//tmax maximum number of iterations. 

//Return the basis function weights w computed by approximate policy iteration. 
//Initialize weights 
Let w<°) = 0. 

//Cache the backprojections of the basis functions. 

For EACH BASIS FUNCTION hi £ H; FOR EACH ACTION a: 

Let gf = Backproj a (hi). 
//Main approximate policy iteration loop. 
Let t = 0. 
Repeat 

//Policy improvement part of the loop. 

/ / Compute decision list policy for iteration t weights. 
Let AW = DECISIONLlSTPOLICY(i? a + 7 Ei wf^f) • 
//Value determination part of the loop. 

/ /Initialize constraints for max-norm projection LP. 
Let fl+ = {} and fi- = {}. 
//Initialize indicators. 
Let 1= {}. 

//For every branch of the decision list policy, generate the relevant set of constraints, and 

update the indicators to constraint the state space for future branches. 
For EACH BRANCH (tj,Cbj} IN THE DECISION LIST POLICY A W : 

//Instantiate the variables in Tj to the assignment given in tj. 

Instantiate the set of functions {hi - jg^ , . . . ,h k - 75fe 3 } with the 

PARTIAL STATE ASSIGNMENT tj AND STORE IN C. 

Instantiate the target functions R a i with the partial state assign- 
ment tj and store in b. 

Instantiate the indicator functions 1 with the partial state as- 
signment tj and store in T . 

//Generate the factored LP constraints for the current decision list branch. 
Let fl+ = fl+ U FactoredLP(C, -b+1',0). 
Let ft" = ft" UFACTOREDLP(-C,b+J',0). 
//Update the indicator functions. 

Let Xj(x) = — OOl(x = tj) AND UPDATE THE INDICATORS I = I U Tj . 
//We can now obtain the new set of weights by solving an LP, which corresponds to the 
max-norm projection. 

Let w(* +1 ) BE THE SOLUTION OF THE LINEAR PROGRAM: MINIMIZE (f), SUBJECT 
TO THE CONSTRAINTS {ft+,ft~}. 

Let t = t + l. 

Until BellmanErr(HwW) < e OR t > t max or w^ 1 ) = w^. 
Return . 



Figure 9: Factored approximate policy iteration with max-norm projection algorithm. 
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{Xi, . . . , X n } — Tj, and computes the maximum only over the states consistent with Tj = tj. 
To guarantee the second condition, we ensure that we do not impose any constraints on 
states associated with previous decisions. This is achieved by adding indicators lj for each 
previous decision tj, with weight — oo. More specifically, Tj is a function that takes value 
— oo for states consistent with tj and zero for other all assignments of Tj. The constraints 
for the ith branch will be of the form: 

0>i2(x,a i ) + 2«;,(7(/i(x,a i )-/i(x)) + 5^-ool(x = t j ), Vx~[tj], (21) 

l j<i 

where x ~ [tj] defines the assignments of X consistent with tj. The introduction of these 
indicators causes the constraints associated with tj to be trivially satisfied by states in Sj 
for j < i. Note that each of these indicators is a restricted-scope function of Tj and can 
be handled in the same fashion as all other terms in the factored LP. Thus, for a decision 
list of size L, our factored LP contains constraints from 2L cost networks. The complete 
approximate policy iteration with max-norm projection algorithm is outlined in Figure 9. 

6.4 Comparisons 

It is instructive to compare our max-norm policy iteration algorithm to the /^-projection 
policy iteration algorithm of Roller and Parr (2000) in terms of computational costs per 
iteration and implementation complexity. Computing the Li projection requires (among 
other things) a series of dot product operations between basis functions and backprojected 
basis functions {hi»gj). These expressions are easy to compute if P n refers to the transition 
model of a particular action a. However, if the policy ir is represented as a decision list, as is 
the result of the policy improvement step, then this step becomes much more complicated. 
In particular, for every branch of the decision list, for every pair of basis functions i and j, 
and for each assignment to the variables in Scope[/ij] U Scope[(/j], it requires the solution of 
a counting problem which is jjP-complete in general. Although Roller and Parr show that 
this computation can be performed using a Bayesian network (BN) inference, the algorithm 
still requires a BN inference for each one of those assignments at each branch of the decision 
list. This makes the algorithm very difficult to implement efficiently in practice. 

The max-norm projection, on the other hand, relies on solving a linear program at every 
iteration. The size of the linear program depends on the cost networks generated. As we 
discuss, two cost networks are needed for each point in the decision list. The complexity 
of each of these cost networks is approximately the same as only one of the BN inferences 
in the counting problem for the £2 projection. Overall, for each branch in the decision 
list, we have a total of two of these "inferences," as opposed to one for each assignment of 
Scope[/ij] U Scope^™] for every pair of basis functions i and j. Thus, the max-norm policy 
iteration algorithm is substantially less complex computationally than the approach based 
on /^-projection. Furthermore, the use of linear programming allows us to rely on existing 
LP packages (such as CPLEX), which are very highly optimized. 

It is also interesting to compare the approximate policy iteration algorithm to the ap- 
proximate linear programming algorithm we presented in Section 5. In the approximate 
linear programming algorithm, we never need to compute the decision list policy. The 
policy is always represented implicitly by the Q a functions. Thus, this algorithm does not 
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require explicit computation or manipulation of the greedy policy. This difference has two 
important consequences: one computational and the other in terms of generality. 

First, not having to compute or consider the decision lists makes approximate linear 
programming faster and easier to implement. In this algorithm, we generate a single LP 
with one cost network for each action and never need to compute a decision list policy. On 
the other hand, in each iteration, approximate policy iteration needs to generate two LPs 
for every branch of the decision list of size L, which is usually significantly longer than \A\, 
with a total of 2L cost networks. In terms of representation, we do not require the policies 
to be compact; thus, we do not need to make the default action assumption. Therefore, the 
approximate linear programming algorithm can deal with a more general class of problems, 
where each action can have its own independent DBN transition model. On the other hand, 
as described in Section 3.2, approximate policy iteration has stronger guarantees in terms 
of error bounds. These differences will be highlighted further in our experimental results 
presented in Section 9. 



7. Computing Bounds on Policy Quality 

We have presented two algorithms for computing approximate solutions to factored MDPs. 
All these algorithms generate linear value functions which can be denoted by Hw, where w 
are the resulting basis function weights. In practice, the agent will define its behavior by 
acting according to the greedy policy n = Greedy(Hw). One issue that remains is how this 
policy 7? compares to the true optimal policy ir*; that is, how the actual value V~ of policy 
7T compares to V*. 

In Section 3, we showed some a priori bounds for the quality of the policy. Another 
possible procedure is to compute an a posteriori bound. That is, given our resulting weights 
w, we compute a bound on the loss of acting according to the greedy policy tt rather than 
the optimal policy. This can be achieved by using the Bellman error analysis of Williams 
and Baird (1993). 

The Bellman error is defined as BellmanErr(V) = ||T*V — V)^. Given the greedy 
policy 7r = GreedyiV), their analysis provides the bound: 

< 2 7 BellmanErr(V) 

II 7rlloo— 1 — 7 

Thus, we can use the Bellman error BellmanErr(Hw) to evaluate the quality of our resulting 
greedy policy. 

Note that computing the Bellman error involves a maximization over the state space. 
Thus, the complexity of this computation grows exponentially with the number of state 
variables. Koller and Parr (2000) suggested that structure in the factored MDP can be 
exploited to compute the Bellman error efficiently Here, we show how this error bound can 
be computed by a set of cost networks using a similar construction to the one in our max- 
norm projection algorithms. This technique can be used for any 7? that can be represented 
as a decision list and does not depend on the algorithm used to determine the policy. Thus, 
we can apply this technique to solutions determined approximate linear programming if the 
action descriptions permit a decision list representation of the policy. 

For some set of weights w, the Bellman error is given by: 
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FactoredBellmanErr (P, R, 7, H, O, w) 

/ /P is the factored transition model. 

//R is the set of factored reward functions. 

/ /y is the discount factor. 

//H is the set of basis functions H = {hi, . . . , hk}. 

/ /O stores the elimination order. 

//w are the weights for the linear value function. 

//Return the Bellman error for the value function Hw. 
/ / Cache the backprojections of the basis functions. 
For EACH BASIS FUNCTION h t G H ; FOR EACH ACTION a: 

Let gf = Backproj a (hi). 
//Compute decision list policy for value function Hw. 
Let A = DECISIONLlSTPOLICY(i? a + 7^ Wi9i)- 
/ /Initialize indicators. 
Let 1= {}. 

/ /Initialize Bellman error. 
Let e = 0. 

//For every branch of the decision list policy, generate the relevant cost networks, solve it with 
variable elimination, and update the indicators to constraint the state space for future branches. 
For EACH BRANCH (tj,dj) IN THE DECISION LIST POLICY A: 
/ /Instantiate the variables in Tj to the assignment given in tj . 

Instantiate the set of functions {wi(hi-jg^ j ), . . . ,Wk(hk~jgl j )} with the 

PARTIAL STATE ASSIGNMENT tj AND STORE IN C. 

Instantiate the target functions R a] with the partial state assignment 
tj and store in b. 

Instantiate the indicator functions 1 with the partial state assignment 

tj and store in T . 
//Use variable elimination to solve first cost network, and update Bellman error, if error 

for this branch is larger. 
Let e = max (e, VariableElimination(C - b + J', O)). 

//Use variable elimination to solve second cost network, and update Bellman error, if error 

for this branch is larger. 
Let e = max (e, VariableElimination(-C + b+l',0)). 
//Update the indicator functions. 

Let Ij(x) = — OOl(x = tj) AND UPDATE THE INDICATORS I = I U Tj . 

Return e. 



Figure 10: Algorithm for computing Bellman error for factored value function Hw. 
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BellmanErr(Hw) = ||T*Hw - Hw|| M ; 

max ( maXx ^ Wihi ^ ~ R ^ ~ 7 P ^ x ' I x ) w 3 h i ( x ') ' ^ 
V max x i2-(x) + 7 Ex' ^( x ' I x ) Ej u>jfy (*') - E» ^(x) y ' 

If the rewards i?~ and the transition model are factored appropriately, then we can 
compute each one of these two maximizations (max x ) using variable elimination in a cost 
network as described in Section 4.2.1. However, tx is a decision list policy and it does not 
induce a factored transition model. Fortunately, as in the approximate policy iteration 
algorithm in Section 6, we can exploit the structure in the decision list to perform such 
maximization efficiently. In particular, as in approximate policy iteration, we will generate 
two cost networks for each branch in the decision list. To guarantee that our maximization 
is performed only over states where this branch is relevant, we include the same type of 
indicator functions, which will force irrelevant states to have a value of —00, thus guaran- 
teeing that at each point of the decision list policy we obtain the corresponding state with 
the maximum error. The state with the overall largest Bellman error will be the maximum 
over the ones generated for each point the in the decision list policy. The complete factored 
algorithm for computing the Bellman error is outlined in Figure 10. 

One last interesting note concerns our approximate policy iteration algorithm with max- 
norm projection of Section 6. In all our experiments, this algorithm converged, so that 
= w(* +1 ) after some iterations. If such convergence occurs, then the objective function 
of the linear program in our last iteration is equal to the Bellman error of the final 

policy: 

Lemma 7.1 // approximate policy iteration with max-norm projection converges, so that 
= w(* +1 ) for some iteration t, then the max-norm projection error of the last 

iteration is equal to the Bellman error for the final value function estimate Hw = Hw^: 

BellmanErr(Hw) = <p {t+l) . 

Proof: See Appendix A. 4. □ 

Thus, we can bound the loss of acting according to the final policy 7r(' +1 ) by substituting 
into the Bellman error bound: 

Corollary 7.2 If approximate policy iteration with max-norm projection converges after 
t iterations to a final value function estimate Hw associated with a greedy policy it = 
Greedy(Hw), then the loss of acting according to tt instead of the optimal policy ir* is 
bounded by: 

y*-V~ < -JJL 

H K 7TlloO — I _j > 

where V~ is the actual value of the policy n. □ 

Therefore, when approximate policy iteration converges we can obtain a bound on the 
quality of the resulting policy without needing to compute the Bellman error explicitly. 
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8. Exploiting Context-specific Structure 

Thus far, we have presented a suite of algorithms which exploit additive structure in the 
reward and basis functions and sparse connectivity in the DBN representing the transition 
model. However, there exists another important type of structure that should also be 
exploited for efficient decision making: context- specific independence (CSI). For example, 
consider an agent responsible for building and maintaining a house, if the painting task can 
only be completed after the plumbing and the electrical wiring have been installed, then 
the probability that the painting is done is 0, in all contexts where plumbing or electricity 
are not done, independently of the agents action. The representation we have used so far in 
this paper would use a table to represent this type of function. This table is exponentially 
large in the number of variables in the scope of the function, and ignores the context-specific 
structure inherent in the problem definition. 

Boutilier et al. (Boutilier et al, 1995; Dearden & Boutilier, 1997; Boutilier, Dean, & 
Hanks, 1999; Boutilier et al, 2000) have developed a set of algorithms which can exploit CSI 
in the transition and reward models to perform efficient (approximate) planning. Although 
this approach is often successful in problems where the value function contains sufficient 
context-specific structure, the approach is not able to exploit the additive structure which 
is also often present in real- world problems. 

In this section, we extend the factored MDP model to include context-specific structure. 
We present a simple, yet effective extension of our algorithms which can exploit both CSI 
and additive structure to obtain efficient approximations for factored MDPs. We first extend 
the factored MDP representation to include context-specific structure and then show how 
the basic operations from Section 4 required by our algorithms can be performed efficiently 
in this new representation. 

8.1 Factored MDPs with Context-specific and Additive Structure 

There are several representations for context-specific functions. The most common are 
decision trees (Boutilier et al, 1995), algebraic decision diagrams (ADDs) (Hoey, St-Aubin, 
Hu, & Boutilier, 1999), and rules (Zhang & Poole, 1999). We choose to use rules as our 
basic representation, for two main reasons. First, the rule-based representation allows a 
fairly simple algorithm for variable elimination, which is a key operation in our framework. 
Second, rules are not required to be mutually exclusive and exhaustive, a requirement that 
can be restrictive if we want to exploit additive independence, where functions can be 
represented as a linear combination of a set of non- mutually exclusive functions. 

We begin by describing the rule-based representation (along the lines of Zhang and 
Poole's presentation (1999)) for the probabilistic transition model, in particular, the CPDs 
of our DBN model. Roughly speaking, each rule corresponds to some set of CPD entries 
that are all associated with a particular probability value. These entries with the same 
value are referred to as consistent contexts: 

Definition 8.1 Let C C {X, X'} and c E Dom(C). We say that c is consistent with 
b € Dom(B), for B C {X, X'}, if c and b have the same assignment for the variables in 
CnB. □ 

The probability of these consistent contexts will be represented by probability rules: 
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Electrical 
Naione / X Done 



P(Painting'f 




Done 



P(Painting'f P(Painting'f .95 

(a) 



Electrical 
Naione / X Done 




P(Puinting'# \_Plumbing 

Naione "jf "\.Done 

P(Palntlng ' # < ^Pat^2 > 
Notone^^^^"^V^p^one 

P(Painting'f P(Painting'<J.9 

(b) 



771 = (-lElectrical : 0} 

772 = (Electrical A -1 Plumbing : 0) 

773 = (Electrical A Plumbing : 0.95) 

(c) 



7?4 = (-lElectrical : 0} 
7/5 = (Electrical A -Plumbing : 0} 
7]6 = (Electrical A Plumbing A -Painting : 0} 
7/7 = (Electrical A Plumbing A Painting : 0.9) 

(d) 



Figure 11: Example CPDs for variable the Painting' = true represented as decision trees: 
(a) when the action is paint; (b) when the action is not paint. The same CPDs 
can be represented by probability rules as shown in (c) and (d), respectively. 



Definition 8.2 A probability rule 77 = (c : p) is a function 77 : {X, X'} 1— > [0, 1], where the 
context c E Dom(C) for C C {X,X'} and p G [0,1], such that 7/(x,x') = p if (x, x') is 
consistent with c and is equal to 1 otherwise. □ 

In this case, it is convenient to require that the rules be mutually exclusive and exhaus- 
tive, so that each CPD entry is uniquely defined by its association with a single rule. 

Definition 8.3 A rule-based conditional probability distribution (rule CPD) P a is a func- 
tion P a : ({X-} U X) ^ [0, 1], composed of a set of probability rules {771,772, . . . ,T/ m } whose 
contexts are mutually exclusive and exhaustive. We define: 

p a{x'i I x) = 77j(x,x'), 

where rjj is the unique rule in P a for which Cj is consistent with (a^,x). We require that, 
for all x, 

£P a (^|x) = l. □ 

We can define Parents a (X' i ) to be the union of the contexts of the rules in Pa(X' { | X). An 
example of a CPD represented by a set of probability rules is shown in Figure 11. 

Rules can also be used to represent additive functions, such as reward or basis functions. 
We represent such context specific value dependencies using value rules: 
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Definition 8.4 A value rule p = (c : v) is a function p : X i— > R st/c/i f/iai p(x) = t> w/ien 
x is consistent with c and otherwise. □ 

Note that a value rule (c : v) has a scope C. 

It is important to note that value rules are not required to be mutually exclusive and 
exhaustive. Each value rule represents a (weighted) indicator function, which takes on a 
value v in states consistent with some context c, and in all other states. In any given state, 
the values of the zero or more rules consistent with that state are simply added together. 

Example 8.5 In our construction example, we might have a set of rules: 



which, when summed together, define the reward function R = pi + p2 + p3 + Pa + • ■ ■ • □ 

In general, our reward function R a is represented as a rule-based function: 

Definition 8.6 A rule-based function / : X i— > R is composed of a set of rules {p\, . . . , p n } 
such that /(x) = Ya=i P*( x )- a 

In the same manner, each one of our basis functions hj is now represented as a rule-based 
function. 

This notion of a rule-based function is related to the tree-structure functions used by 
Boutilier et al. (2000), but is substantially more general. In the tree-structure value func- 
tions, the rules corresponding to the different leaves are mutually exclusive and exhaustive. 
Thus, the total number of different values represented in the tree is equal to the number 
of leaves (or rules). In the rule-based function representation, the rules are not mutually 
exclusive, and their values are added to form the overall function value for different settings 
of the variables. Different rules are added in different settings, and, in fact, with k rules, 
one can easily generate 2 k different possible values, as is demonstrated in Section 9. Thus, 
the rule-based functions can provide a compact representation for a much richer class of 
value functions. 

Using this rule-based representation, we can exploit both CSI and additive independence 
in the representation of our factored MDP and basis functions. We now show how the basic 
operations in Section 4 can be adapted to exploit our rule-based representation. 

8.2 Adding, Multiplying and Maximizing Consistent Rules 

In our table-based algorithms, we relied on standard sum and product operators applied to 
tables. In order to exploit CSI using a rule-based representation, we must redefine these 
standard operations. In particular, the algorithms will need to add or multiply rules that 
ascribe values to overlapping sets of states. 

We will start by defining these operations for rules with the same context: 



Pi 
P2 : 

P3 
P4 



(Plumbing = done : 100); 
(Electricity = done : 100); 
(Painting = done : 100) ; 
(Action = plumb : —10); 
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Definition 8.7 Let p± = (c : v±) and pi = (c : V2) be two rules with context c. Define the 
rule product as p\ x pi = (c : V\ ■ V2), and the rule sum as p\ + pi = (c : t>i + V2). □ 

Note that this definition is restricted to rules with the same context. We will address this 
issue in a moment. First, we will introduce an additional operation which maximizes a 
variable from a set of rules, which otherwise share a common context: 

Definition 8.8 Let Y be a variable with Dom[Y] = {yi, . . . ,yk}, and let pi, for each i = 
1, . . . , k, be a rule of the form pi = (c A Y = yi : Vi). Then for the rule-based function 
f = p± + • • • + pk, define the rule maximization over Y as maxy / = (c : maxj Vi) . □ 

After this operation, Y has been maximized out from the scope of the function /. 

These three operations we have just described can only be applied to sets of rules that 
satisfy very stringent conditions. To make our set of rules amenable to the application 
of these operations, we might need to refine some of these rules. We therefore define the 
following operation: 

Definition 8.9 Let p = (c : v) be a rule, and Y be a variable. Define the rule split 

Split(plY) of p on a variable Y as follows: If Y £ Scope[C], then Split(plY) = {p}; 
otherwise, 

Split(plY) = {(c A Y = y { : v) | y, L € Dom[Y]} . □ 

Thus, if we split a rule p on variable Y that is not in the scope of the context of p, then we 
generate a new set of rules, with one for each assignment in the domain of Y. 

In general, the purpose of rule splitting is to extend the context c of one rule p coincide 
with the context c' of another consistent rule p' . Naively, we might take all variables in 
Scope[C] — Scope[C] and split p recursively on each one of them. However, this process 
creates unnecessarily many rules: If Y is a variable in ScopefC] — Scope[C] and we split p 
on Y, then only one of the |Dom[Y]| new rules generated will remain consistent with p': the 
one which has the same assignment for Y as the one in c'. Thus, only this consistent rule 
needs to be split further. We can now define the recursive splitting procedure that achieves 
this more parsimonious representation: 

Definition 8.10 Let p = (c : v) be a rule, and b be a context such that b G Dom[B]. 
Define the recursive rule split Split(pLh) of p on a context b as follows: 

1. {p}, if c is not consistent with b; else, 

2. {p}, i/Scope[B] C Scope[C]; else, 

3. {Split(pilb) I pi G Split(pLY)}, for some variable Y € ScopefB] — Scope[C] . 
□ 

In this definition, each variable Y € Scope [B] — Scope [C] leads to the generation of k = 
|Dom(Y)| rules at the step in which it is split. However, only one of these k rules is used 
in the next recursive step because only one is consistent with b. Therefore, the size of the 
split set is simply 1 + X)yeScopc[B]-Scope[c](l-^ om (^)l — !)• This size is independent of the 
order in which the variables are split within the operation. 
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Note that only one of the rules in Split(pAh) is consistent with b: the one with context 
c A b. Thus, if we want to add two consistent rules p\ = {c\ : v±) and P2 = (C2 : ^2), then 
all we need to do is replace these rules by the set: 

Split(pilc2) U Split(p2lci), 

and then simply replace the resulting rules (ci A C2 : v\) and (C2 A ci : V2) by their sum 
(ci A C2 : v\ + 1>2). Multiplication is performed in an analogous manner. 

Example 8.11 Consider adding the following set of consistent rules: 

pi = {a A b : 5), 

p 2 = {a A -ic A d : 3). 

In these rules, the context c\ of pi is a A b, and the context C2 of p2 is a A —>c A d. 

Rules pi and P2 are consistent, therefore, we must split them to perform the addition 
operation: 

(a A b A c : 5), 
(aAbA^cA^d: 5), 
(a A b A -ic A d : 5). 



Split{pilc2) = < 

Likewise, 



Split(p 2 lci) = 
The result of adding rules p\ and p2 is 



(a A -16 A -ic A eZ : 3), 
(a A 6 A -ic A d : 3). 



(a A b A c : 5), 
(a A 6 A -ic A ->d : 5), 
(a Ab A -ic A d : 8), 
(a A -<b A -ic A d : 3). 



□ 



8.3 Rule-based One-step Lookahead 

Using this compact rule-based representation, we are able to compute a one-step lookahead 
plan efficiently for models with significant context-specific or additive independence. 

As in Section 4.1 for the table-based case, the rule-based Q a function can be represented 
as the sum of the reward function and the discounted expected value of the next state. 
Due to our linear approximation of the value function, the expectation term is, in turn, 
represented as the linear combination of the backprojections of our basis functions. To 
exploit CSI, we are representing the rewards and basis functions as rule-based functions. 
To represent Q a as a rule-based function, it is sufficient for us to show how to represent the 
backprojection gj of the basis function hj as a rule-based function. 

Each hj is a rule-based function, which can be written as hj(x) = X)« ( x ) ) where 
pf 1 ^ has the form ^c^ J ' 1 : vf 1 ^. Each rule is a restricted scope function; thus, we can 
simplify the backprojection as: 
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RuleBackproj a (p) , WHERE p IS GIVEN BY (c : v), WITH c e Dom[C]. 
Let g = {}. 

Select THE SET V OF RELEVANT PROBABILITY RULES: 

V = {rjj e P(X- I Parents(X[)) \ X[ E C and c is consistent with Cj}. 

Remove THE X' ASSIGNMENTS FROM THE CONTEXT OF ALL RULES IN V. 

// Multiply consistent rules: 

While THERE ARE TWO CONSISTENT RULES r/i = (ci : pi) AND TJ 2 = (c 2 : p 2 ): 
If Ci = C 2 , REPLACE THESE TWO RULES BY (ci : P1P2)] 

Else REPLACE THESE TWO RULES BY THE SET: Split (r)i Zc 2 ) U SpUt(r]2 /Ci) . 

// Generate value rules: 

For EACH RULE T]i IN P\ 

Update the backprojection g = g U {(cj : Piv)}. 
Return g. 



Figure 12: Rule-based backprojection. 
#(x) = ^P^x'lx^^x') ; 

x' 

= E^( x '|x)E/>f j) (x'); 

x' i 

= ££Pa(xMx)^V); 

i x' 

i 

where the term ^ hj ' ) P a (c^ J ' ) | x) can be written as a rule function. We denote this back- 
projection operation by RuleBackproj a (p[ hj ^) . 

The backprojection procedure, described in Figure 12, follows three steps. First, the 
relevant rules are selected: In the CPDs for the variables that appear in the context of p, 
we select the rules consistent with this context, as these are the only rules that play a role 
in the backprojection computation. Second, we multiply all consistent probability rules to 
form a local set of mutually-exclusive rules. This procedure is analogous to the addition 
procedure described in Section 8.2. Now that we have represented the probabilities that 
can affect p by a mutually-exclusive set, we can simply represent the backprojection of p 
by the product of these probabilities with the value of p. That is, the backprojection of p is 
a rule-based function with one rule for each one of the mutually-exclusive probability rules 
rji. The context of this new value rule is the same as that of rji, and the value is the product 
of the probability of rji and the value of p. 

Example 8.12 For example, consider the backprojection of a simple rule, 

p = { Painting = done : 100), 
through the CPD in Figure 11(c) for the paint action: 

RuleBackproj pSiint (p) = J2 p paint(x' | x)p(x'); 

x' 
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= X! Ppaiat( Paintin 9 I X.)p(Painting); 

Painting 1 

3 

= 100 ~^Yrn(P ainting' = done,x) . 
i=i 

Note that the product of these simple rules is equivalent to the decision tree CPD shown in 
Figure 11(a). Hence, this product is equal to in most contexts, for example, when electricity 
is not done at time t. The product in non-zero only in one context: in the context associated 
with rule 773. Thus, we can express the result of the backprojection operation by a rule-based 
function with a single rule: 

RuleBackproj pSL ^ nt (p) = (Plumbing A Electrical : 95). 

Similarly, the backprojection of p when the action is not paint can also be represented by a 
single rule: 

RuleBackproj^pai^lp) = {Plumbing A Electrical A Painting : 90). □ 

Using this algorithm, we can now write the backprojection of the rule-based basis func- 
tion hj as: 

g] (x) = RuleBackpro ]a {pf > } ) , (23) 

i 

where g® is a sum of rule-based functions, and therefore also a rule-based function. For 
simplicity of notation, we use g^ = RuleBackproj a (hj) to refer to this definition of backpro- 
jection. Using this notation, we can write Q a ( x ) = -R a (x) + jJ2j w j9j ( x )> which is again a 
rule-based function. 

8.4 Rule-based Maximization Over the State Space 

The second key operation required to extend our planning algorithms to exploit CSI is to 
modify the variable elimination algorithm in Section 4.2.1 to handle the rule-based rep- 
resentation. In Section 4.2.1, we showed that the maximization of a linear combination 
of table-based functions with restricted scope can be performed efficiently using non-serial 
dynamic programming (Bertele & Brioschi, 1972), or variable elimination. To exploit struc- 
ture in rules, we use an algorithm similar to variable elimination in a Bayesian network with 
context-specific independence (Zhang & Poole, 1999). 

Intuitively, the algorithm operates by selecting the value rules relevant to the variable 
being maximized in the current iteration. Then, a local maximization is performed over 
this subset of the rules, generating a new set of rules without the current variable. The 
procedure is then repeated recursively until all variables have been eliminated. 

More precisely, our algorithm "eliminates" variables one by one, where the elimina- 
tion process performs a maximization step over the variable's domain. Suppose that we 
are eliminating JQ, whose collected value rules lead to a rule function /, and / involves 
additional variables in some set B, so that /'s scope is B U {Aj}. We need to compute 
the maximum value for Aj for each choice of b € Dom[B]. We use MaxOut (/, Xj) to de- 
note a procedure that takes a rule function /(B, Xj) and returns a rule function <?(B) such 
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MaxOut (f, B) 
Let g = {}. 

Add COMPLETING RULES TO /: (B = bi : 0), i = 1, . . . , k. 
// Summing consistent rules: 

While THERE ARE TWO CONSISTENT RULES pi = (ci : V\) AND p 2 = (c 2 : V 2 )\ 
If Ci = C 2 , THEN REPLACE THESE TWO RULES BY (ci : V\ + V 2 ); 
Else REPLACE THESE TWO RULES BY THE SET: SpUt(pilc 2 ) U Split (p 2 Lc\ ) . 

// Maximizing out variable B: 
Repeat until / IS empty: 

If there are rules (cAJ3 = ft,: V6j £ Dom(B) : 

Then REMOVE THESE RULES FROM / AND ADD RULE (c : maXj Vi) TO g; 

Else select two rules: pi = (c; A B — bi : Vi) and pj = (cj A B = bj : Vj) 

SUCH THAT Ci IS CONSISTENT WITH Cj, BUT NOT IDENTICAL, AND REPLACE 
THEM WITH SpUt(pilCj)U SpUt(pjlCi) . 

Return g. 

Figure 13: Maximizing out variable B from rule function /. 

that: g(h) = max^ /(b, x{). Such a procedure is an extension of the variable elimination 
algorithm of Zhang and Poole (Zhang & Poole, 1999). 

The rule-based variable elimination algorithm maintains a set T of value rules, initially 
containing the set of rules to be maximized. The algorithm then repeats the following steps 
for each variable Xj until all variables have been eliminated: 

1. Collect all rules which depend on Xj into /j — ft = {(c : v) € T \ X; L £ C} — and 
remove these rules from T . 

2. Perform the local maximization step over X: = MaxOut (/j, Xj); 

3. Add the rules in gi to T\ now, Xj has been "eliminated." 

The cost of this algorithm is polynomial in the number of new rules generated in the 
maximization operation MaxOut (/j, Xj). The number of rules is never larger and in many 
cases exponentially smaller than the complexity bounds on the table-based maximization in 
Section 4.2.1, which, in turn, was exponential only in the induced width of the cost network 
graph (Dechter, 1999). However, the computational costs involved in managing sets of rules 
usually imply that the computational advantage of the rule-based approach over the table- 
based one will only be significant in problems that possess a fair amount of context-specific 
structure. 

In the remainder of this section, we present the algorithm for computing the local 
maximization MaxOut (/j, Xj). In the next section, we show how these ideas can be applied 
to extending the algorithm in Section 4.2.2 to exploit CSI in the LP representation for 
planning in factored MDPs. 

The procedure, presented in Figure 13, is divided into two parts: first, all consistent 
rules are added together as described in Section 8.2; then, variable B is maximized. This 
maximization is performed by generating a set of rules, one for each assignment of B, whose 
contexts have the same assignment for all variables except for B, as in Definition 8.8. This 
set is then substituted by a single rule without a B assignment in its context and with value 
equal to the maximum of the values of the rules in the original set. Note that, to simplify 
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the algorithm, we initially need to add a set of value rules with value, which guarantee 
that our rule function / is complete (i.e., there is at least one rule consistent with every 
context). 

The correctness of this procedure follows directly from the correctness of the rule-based 
variable elimination procedure described by Zhang and Poole, merely by replacing summa- 
tions with product with max, and products with products with sums. We conclude this 
section with a small example to illustrate the algorithm: 

Example 8.13 Suppose we are maximizing a for the following set of rules: 

pi = (-.a : 1), 
p 2 = (a A -16 : 2), 
p 3 = (a A b A -ic : 3) , 
P4 = {->a A b : 1). 

When we add completing rules, we get: 

p 5 = (-.a : 0), 
p 6 = (a : 0). 

In the first part of the algorithm, we need to add consistent rules: We add p§ to p\ (which 
remains unchanged), combine p\ with p^, p§ with p2, and then the split of p% on the context 
°f P3> t° 9 e t the following inconsistent set of rules: 

P2 = (a A -6: 2), 
P3 = (a A b A ->c : 3) , 

pj = (— >a A b : 2), (from adding p4 to the consistent rule from Split(pilb)) 

PS = {->a A -16 : 1), (from Split(pilh)) 

Pq = (a A b A c : 0), (from Split(p 6 la A & A -ic) ). 

Note that several rules with value are also generated, but not shown here because they are 
added to other rules with consistent contexts. We can move to the second stage (repeat loop) 
of MaxOut. We remove p2, and ps, and maximize a out of them, to give: 

Pio = (-6:2). 

We then select rules P3 and p-j and split p-j on c (p% is split on the empty set and is not 
changed), 

Pu = (->a A b A c : 2), 
P12 = (->a A b A -ic : 2). 
Maximizing out a from rules p±2 and ps, we get: 

P13 = (b A -ic : 3). 

We are left with p\\, which maximized over its counterpart pg gives 

P12 = (b A -ic : 2). 

Notice that, throughout this maximization, we have not split on the variable C when -16 G Cj, 
giving us only 6 distinct rules in the final result. This is not possible in a table-based 
representation, since our functions would then be over the 3 variables a,b,c, and therefore 
must have 8 entries. □ 
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8.5 Rule-based Factored LP 

In Section 4.2.2, we showed that the LPs used in our algorithms have exponentially many 
constraints of the form: <f> > Yli w i c «( x ) — b(x),\/x, which can be substituted by a single, 
equivalent, non-linear constraint: (ft > max x J2i w i q( x ) — K x )- We then showed that, using 
variable elimination, we can represent this non-linear constraint by an equivalent set of 
linear constraints in a construction we called the factored LP. The number of constraints in 
the factored LP is linear in the size of the largest table generated in the variable elimination 
procedure. This table-based algorithm can only exploit additive independence. We now 
extend the algorithm in Section 4.2.2 to exploit both additive and context-specific structure, 
by using the rule-based variable elimination described in the previous section. 

Suppose we wish to enforce the more general constraint > max y i ?w (y), where -F w (y) = 
J2j fj'iy) suc h that each fj is a rule. As in the table-based version, the superscript w means 
that fj might depend on w. Specifically, if fj comes from basis function hi, it is multiplied 
by the weight wf, if fj is a rule from the reward function, it is not. 

In our rule-based factored linear program, we generate LP variables associated with 
contexts; we call these LP rules. An LP rule has the form (c : u); it is associated with a 
context c and a variable u in the linear program. We begin by transforming all our original 
rules into LP rules as follows: If rule fj has the form (cj : Vj) and comes from basis 
function hi, we introduce an LP rule Bj = (cj : Uj) and the equality constraint Uj = WiVj. 
If fj has the same form but comes from a reward function, we introduce an LP rule of the 
same form, but the equality constraint becomes Uj = Vj. 

Now, we have only LP rules and need to represent the constraint: > max y J2j e j(y)- 
To represent such a constraint, we follow an algorithm very similar to the variable elimina- 
tion procedure in Section 8.4. The main difference occurs in the MaxOut (/, B) operation in 
Figure 13. Instead of generating new value rules, we generate new LP rules, with associated 
new variables and new constraints. The simplest case occurs when computing a split or 
adding two LP rules. For example, when we add two value rules in the original algorithm, 
we instead perform the following operation on their associated LP rules: If the LP rules 
are (c : Ui) and (c : Uj), we replace these by a new rule (c : Uk), associated with a new LP 
variable with context c, whose value should be Ui + Uj. To enforce this value constraint, 
we simply add an additional constraint to the LP: ut = Ui + Uj. A similar procedure can 
be followed when computing the split. 

More interesting constraints are generated when we perform a maximization. In the 
rule-based variable elimination algorithm in Figure 13, this maximization occurs when we 
replace a set of rules: 

(cAB = bi: Vi),Vbi € Bom(B), 

by a new rule 

c : max^y 

Following the same process as in the LP rule summation above, if we are maximizing 

= (c AB = bi : iti),V6j € Dom(S), 

we generate a new LP variable ut associated with the rule = (c : Uk)- However, we 
cannot add the nonlinear constraint = maxj Ui , but we can add a set of equivalent linear 
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constraints 

Uk > Ui, Vz. 

Therefore, using these simple operations, we can exploit structure in the rule functions 
to represent the nonlinear constraint e n > max y J2j e j (y ) j where e n is the very last LP 
rule we generate. A final constraint u n = 4> implies that we are representing exactly the 
constraints in Equation (12), without having to enumerate every state. 

The correctness of our rule-based factored LP construction is a corollary of Theorem 4.4 
and of the correctness of the rule-based variable elimination algorithm (Zhang & Poole, 
1999) . 

Corollary 8.14 The constraints generated by the rule-based factored LP construction are 
equivalent to the non-linear constraint in Equation (12). That is, an assignment to (</>, w) 
satisfies the rule-based factored LP constraints if and only if it satisfies the constraint in 
Equation (12). □ 

The number of variables and constraints in the rule-based factored LP is linear in the 
number of rules generated by the variable elimination process. In turn, the number of rules 
is no larger, and often exponentially smaller, than the number of entries in the table-based 
approach. 

To illustrate the generation of LP constraints as just described, we now present a small 
example: 

Example 8.15 Let e\, e<i, and e^ be the set of LP rules which depend on the variable 
b being maximized. Here, rule ei is associated with the LP variable U{: 

e\ = (a A b : u±), 
e2 = (a A b A c : U2), 
e 3 = {a A -.6 : u 3 ), 
e4 = {a A b A ->c : U4). 

In this set, note that rules e\ and e2 are consistent. We combine them to generate the 
following rules: 

65 = {a A b A c : 1x5), 
e$ = {a A b A ->c : ui). 

and the constraint u\ + U2 = u$. Similarly, e§ and e^ may be combined, resulting in: 

e-j = {a A b A ->c : uq). 

with the constraint u% = u\ + U4. Now, we have the following three inconsistent rules for 
the maximization: 

e 3 = (a A ->b : n 3 ), 
65 = {a A b A c : «s), 
ej = {a A b A ->c : tie). 

Following the maximization procedure, since no pair of rules can be eliminated right away, 
we split e 3 and e$ to generate the following rules: 

es = {a A -16 A c : u 3 ), 
eg = (a A -16 A ->c : ii 3 ), 
es = (a A b A c : U5). 
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We can now maximize b out from eg and e§, resulting in the following rule and constraints 
respectively: 

e w = {aAc: u 7 ), 
u 7 > u 5 , 

Likewise, maximizing b out from eg and e§, we get: 

en = (oA-ic: u 8 ), 
us > «3, 
m > u 6 ; 

which completes the elimination of variable b in our rule-based factored LP. □ 

We have presented an algorithm for exploiting both additive and context-specific struc- 
ture in the LP construction steps of our planning algorithms. This rule-based factored LP 
approach can now be applied directly in our approximate linear programming and approx- 
imate policy iteration algorithms, which were presented in Sections 5 and 6. 

The only additional modification required concerns the manipulation of the decision 
list policies presented in Section 6.2. Although approximate linear programming does not 
require any explicit policy representation (or the default action model), approximate pol- 
icy iteration require us to represent such policy. Fortunately, no major modifications are 
required in the rule-based case. In particular, the conditionals (tj,aj,#i) in the decision 
list policies are already context-specific rules. Thus, the policy representation algorithm in 
Section 6.2 can be applied directly with our new rule-based representation. Therefore, we 
now have a complete framework for exploiting both additive and context-specific structure 
for efficient planning in factored MDPs. 

9. Experimental Results 

The factored representation of a value function is most appropriate in certain types of 
systems: Systems that involve many variables, but where the strong interactions between 
the variables are fairly sparse, so that the decoupling of the influence between variables 
does not induce an unacceptable loss in accuracy. As argued by Herbert Simon (1981) 
in "Architecture of Complexity," many complex systems have a "nearly decomposable, 
hierarchical structure," with the subsystems interacting only weakly between themselves. To 
evaluate our algorithm, we selected problems that we believe exhibit this type of structure. 

In this section, we perform various experiments intended to explore the performance 
of our algorithms. First, we compare our factored approximate linear programming (LP) 
and approximate policy iteration (PI) algorithms. We also compare to the /^-projection 
algorithm of Koller and Parr (2000). Our second evaluation compares a table-based im- 
plementation to a rule-based implementation that can exploit CSI. Finally, we present 
comparisons between our approach and the algorithms of Boutilier et al. (2000). 

9.1 Approximate LP and Approximate PI 

In order to compare our approximate LP and approximate PI algorithms, we tested both on 
the SysAdmin problem described in detail in Section 2.1. This problem relates to a system 
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administrator who has to maintain a network of computers; we experimented with various 
network architectures, shown in Figure 1. Machines fail randomly, and a faulty machine 
increases the probability that its neighboring machines will fail. At every time step, the 
SysAdmin can go to one machine and reboot it, causing it to be working in the next time 
step with high probability. Recall that the state space in this problem grows exponentially 
in the number of machines in the network, that is, a problem with m machines has 2 m states. 
Each machine receives a reward of 1 when working (except in the ring, where one machine 
receives a reward of 2, to introduce some asymmetry), a zero reward is given to faulty 
machines, and the discount factor is 7 = 0.95. The optimal strategy for rebooting machines 
will depend upon the topology, the discount factor, and the status of the machines in the 
network. If machine i and machine j are both faulty, the benefit of rebooting i must be 
weighed against the expected discounted impact of delaying rebooting j on j's successors. 
For topologies such as rings, this policy may be a function of the status of every single 
machine in the network. 

The basis functions used included independent indicators for each machine, with value 
1 if it is working and zero otherwise (i.e., each one is a restricted scope function of a single 
variable), and the constant basis, whose value is 1 for all states. We selected straightforward 
variable elimination orders: for the "Star" and "Three Legs" topologies, we first eliminated 
the variables corresponding to computers in the legs, and the center computer (server) was 
eliminated last; for "Ring," we started with an arbitrary computer and followed the ring 
order; for "Ring and Star," the ring machines were eliminated first and then the center one; 
finally, for the "Ring of Rings" topology, we eliminated the computers in the outer rings 
first and then the ones in the inner ring. 

We implemented the factored policy iteration and linear programming algorithms in 
Matlab, using CPLEX as the LP solver. Experiments were performed on a Sun UltraSPARC- 
II, 359 MHz with 256MB of RAM. To evaluate the complexity of the approximate policy 
iteration with max-norm projection algorithm, tests were performed with increasing the 
number of states, that is, increasing number of machines on the network. Figure 14 shows 
the running time for increasing problem sizes, for various architectures. The simplest one 
is the "Star," where the backprojection of each basis function has scope restricted to two 
variables and the largest factor in the cost network has scope restricted to two variables. 
The most difficult one was the "Bidirectional Ring," where factors contain five variables. 

Note that the number of states is growing exponentially (indicated by the log scale in 
Figure 14), but running times increase only logarithmically in the number of states, or 
polynomially in the number of variables. We illustrate this behavior in Figure 14(d), where 
we fit a 3rd order polynomial to the running times for the "unidirectional ring." Note that 
the size of the problem description grows quadratically with the number of variables: adding 
a machine to the network also adds the possible action of fixing that machine. For this 
problem, the computation cost of our factored algorithm empirically grows approximately 
as O ((n ■ |A|) 15 ), for a problem with n variables, as opposed to the exponential complexity 
- -poly (2™, \ A\) — of the explicit algorithm. 

For further evaluation, we measured the error in our approximate value function relative 
to the true optimal value function V*. Note that it is only possible to compute V* for small 
problems; in our case, we were only able to go up to 10 machines. For comparison, we 
also evaluated the error in the approximate value function produced by the /^-projection 
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Figure 14: (a)-(c) Running times for policy iteration with max-norm projection on variants 
of the SysAdmin problem; (d) Fitting a polynomial to the running time for the 
"Ring" topology. 



algorithm of Koller and Parr (2000). As we discussed in Section 6.4, the £2 projections in 
factored MDPs by Koller and Parr are difficult and time consuming; hence, we were only 
able to compare the two algorithms for smaller problems, where an equivalent /^-projection 
can be implemented using an explicit state space formulation. Results for both algorithms 
are presented in Figure 15(a), showing the relative error of the approximate solutions to 
the true value function for increasing problem sizes. The results indicate that, for larger 
problems, the max-norm formulation generates a better approximation of the true optimal 
value function V* than the /^-projection. Here, we used two types of basis functions: the 
same single variable functions, and pairwise basis functions. The pairwise basis functions 
contain indicators for neighboring pairs of machines (i.e., functions of two variables). As 
expected, the use of pairwise basis functions resulted in better approximations. 
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Figure 15: (a) Relative error to optimal value function V* and comparison to £2 projection 
for "Ring" ; (b) For large models, measuring Bellman error after convergence. 



For these small problems, we can also compare the actual value of the policy generated 
by our algorithm to the value of the optimal policy. Here, the value of the policy generated 
by our algorithm is much closer to the value of the optimal policy than the error implied by 
the difference between our approximate value function and V*. For example, for the "Star" 
architecture with one server and up to 6 clients, our approximation with single variable 
basis functions had relative error of 12%, but the policy we generated had the same value 
as the optimal policy. In this case, the same was true for the policy generated by the £2 
projection. In a "Unidirectional Ring" with 8 machines and pairwise basis, the relative 
error between our approximation and V* was about 10%, but the resulting policy only had 
a 6% loss over the optimal policy. For the same problem, the £2 approximation has a value 
function error of 12%, and a true policy loss was 9%. In other words, both methods induce 
policies that have lower errors than the errors in the approximate value function (at least 
for small problems). However, our algorithm continues to outperform the £2 algorithm, 
even with respect to actual policy loss. 

For large models, we can no longer compute the correct value function, so we cannot 
evaluate our results by computing ||V* — Hwf^. Fortunately, as discussed in Section 7, 
the Bellman error can be used to provide a bound on the approximation error and can be 
computed efficiently by exploiting problem-specific structure. Figure 15(b) shows that the 
Bellman error increases very slowly with the number of states. 

It is also valuable to look at the actual decision-list policies generated in our experiments. 
First, we noted that the lists tended to be short, the length of the final decision list policy 
grew approximately linearly with the number of machines. Furthermore, the policy itself 
is often fairly intuitive. In the "Ring and Star" architecture, for example, the decision list 
says: If the server is faulty, fix the server; else, if another machine is faulty, fix it. 

Thus far, we have presented scaling results for running times and approximation error for 
our approximate PI approach. We now compare this algorithm to the simpler approximate 
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LP approach of Section 5. As shown in Figure 16(a), the approximate LP algorithm for 
factored MDPs is significantly faster than the approximate PI algorithm. In fact, approxi- 
mate PI with single-variable basis functions variables is more costly computationally than 
the LP approach using basis functions over consecutive triples of variables. As shown in 
Figure 16(b), for singleton basis functions, the approximate PI policy obtains slightly better 
performance for some problem sizes. However, as we increase the number of basis functions 
for the approximate LP formulation, the value of the resulting policy is much better. Thus, 
in this problem, our factored approximate linear programming formulation allows us to use 
more basis functions and to obtain a resulting policy of higher value, while still maintaining 
a faster running time. These results, along with the simpler implementation, suggest that 
in practice one may first try to apply the approximate linear programming algorithm before 
deciding to move to the more elaborate approximate policy iteration approach. 

9.2 Comparing Table-based and Rule-based Implementations 

Our next evaluation compares a table-based representation, which exploits only additive 
independence, to the rule-based representation presented in Section 8, which can exploit 
both additive and context-specific independence. For these experiments, we implemented 
our factored approximate linear programming algorithm with table-based and rule-based 
representations in C++, using CPLEX as the LP solver. Experiments were performed on 
a Sun UltraSPARC-II, 400 MHz with 1GB of RAM. 

To evaluate and compare the algorithms, we utilized a more complex extension of the 
SysAdmin problem. This problem, dubbed the Process-SysAdmin problem, contains three 
state variables for each machine i in the network: Loadi, Statusi and Selector \. Each com- 
puter runs processes and receives rewards when the processes terminate. These processes 
are represented by the Loadi variable, which takes values in {Idle, Loaded, Success}, and the 
computer receives a reward when the assignment of Loadi is Success. The Statusi variable, 
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Figure 17: Running time for Process- Sys Admin problem for various topologies: (a) "Star"; 
(b) "Ring"; (c) "Reverse star" (with fit function). 
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Figure 18: Fraction of total running time spent in CPLEX for the Process- Sys Admin prob- 
lem with a "Ring" topology. 



representing the status of machine i, takes values in {Good, Faulty, Dead}; if its value is 
Faulty, then processes have a smaller probability of terminating and if its value is Dead, 
then any running process is lost and Loadi becomes Idle. The status of machine i can be- 
come Faulty and eventually Dead at random; however, if machine i receives a packet from 
a dead machine, then the probability that Statusi becomes Faulty and then Dead increases. 
The Selector variable represents this communication by selecting one of the neighbors of i 
uniformly at random at every time step. The Sys Admin can select at most one computer 
to reboot at every time step. If computer i is rebooted, then its status becomes Good 
with probability 1, but any running process is lost, i.e., the Loadi variable becomes Idle. 
Thus, in this problem, the SysAdmin must balance several conflicting goals: rebooting a 
machine kills processes, but not rebooting a machine may cause cascading faults in network. 
Furthermore, the SysAdmin can only choose one machine to reboot, which imposes the ad- 
ditional tradeoff of selecting only one of the (potentially many) faulty or dead machines in 
the network to reboot. 

We experimented with two types of basis functions: "single+" includes indicators over 
all of the joint assignments of Loadi, Statusi and S elector i, and "pair" which, in addition, 
includes a set of indicators over Statusi, Status j, and S 'elector \ = j, for each neighbor j 
of machine i in the network. The discount factor was 7 = 0.95. The variable elimination 
order eliminated all of the Loadi variables first, and then followed the same patterns as in 
the simple SysAdmin problem, eliminating first Statusi and then Selector when machine i 
is eliminated. 

Figure 17 compares the running times for the table-based implementation to the ones 
for the rule-based representation for three topologies: "Star," "Ring," and "Reverse star." 
The "Reverse star" topology reverses the direction of the influences in the "Star": rather 
than the central machine influencing all machines in the topology, all machines influence 
the central one. These three topologies demonstrate three different levels of CSI: In the 
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"Star" topology, the factors generated by variable elimination are small. Thus, although the 
running times are polynomial in the number of state variables for both methods, the table- 
based representation is significantly faster than the rule-based one, due to the overhead of 
managing the rules. The "Ring" topology illustrates an intermediate behavior: "single+" 
basis functions induce relatively small variable elimination factors, thus the table-based 
approach is faster. However, with "pair" basis the factors are larger and the rule-based 
approach starts to demonstrate faster running times in larger problems. Finally, the "Re- 
verse star" topology represents the worst-case scenario for the table-based approach. Here, 
the scope of the backprojection of a basis function for the central machine will involve all 
computers in the network, as all machines can potentially influence the central one in the 
next time step. Thus, the size of the factors in the table-based variable elimination ap- 
proach are exponential in the number of machines in the network, which is illustrated by 
the exponential growth in Figure 17(c). The rule-based approach can exploit the CSI in this 
problem; for example, the status of the central machine Status® only depends on machine 
j if the value selector is j, i.e., if Selector® = j. By exploiting CSI, we can solve the same 
problem in polynomial time in the number of state variables, as seen in the second curve in 
Figure 17(c). 

It is also instructive to compare the portion of the total running time spent in CPLEX 
for the table-based as compared to the rule-based approach. Figure 18 illustrates this 
comparison. Note that amount of time spent in CPLEX is significantly higher for the 
table-based approach. There are two reasons for this difference: first, due to CSI, the LPs 
generated by the rule-based approach are smaller than the table-based ones; second, rule- 
based variable elimination is more complex than the table-based one, due to the overhead 
introduced by rule management. Interestingly, the proportion of CPLEX time increases as 
the problem size increases, indicating that the asymptotic complexity of the LP solution is 
higher than that of variable elimination, thus suggesting that, for larger problems, additional 
large-scale LP optimization procedures, such as constraint generation, may be helpful. 

9.3 Comparison to Apricodd 

The most closely related work to ours is a line of research that began with the work of 
Boutilier et al. (1995). In particular, the approximate Apricodd algorithm of Hoey et 
al. (1999), which uses analytic decision diagrams (ADDs) to represent the value function 
is a strong alternative approach for solving factored MDPs. As discussed in detail in Sec- 
tion 10, the Apricodd algorithm can successfully exploit context-specific structure in the 
value function, by representing it with the set of mutually-exclusive and exhaustive branches 
of the ADD. On the other hand, our approach can exploit both additive and context-specific 
structure in the problem, by using a linear combination of non-mutually-exclusive rules. To 
better understand this difference, we evaluated both our rule-based approximate linear 
programming algorithm and Apricodd in two problems, Linear and Expon, designed by 
Boutilier et al. (2000) to illustrate respectively the best-case and the worst-case behavior 
of their algorithm. In these experiments, we used the web-distributed version of Apri- 
codd (Hoey, St-Aubin, Hu, & Boutilier, 2002), running it locally on a Linux Pentium III 
700MHz with 1GB of RAM. 
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Figure 19: Comparing Apricodd to rule-based approximate linear programming on the (a) 
Linear and (b) Expon problems. 



These two problems involve n binary variables X\ , . . . , X n and n deterministic actions 
d, . . . ,a n . The reward is 1 when all variables Xk are true, and is otherwise. The problem 
is discounted by a factor 7 = 0.99. The difference between the Linear and the Expon 
problems is in the transition probabilities. In the Linear problem, the action sets the 
variable Xk to true and makes all succeeding variables, X{ for i > k, false. If the state space 
of the Linear problem is seen as a binary number, the optimal policy is to set repeatedly the 
largest bit (A& variable) which has all preceding bits set to true. Using an ADD, the optimal 
value function for this problem can be represented in linear space, with n+1 leaves (Boutilier 
et ah, 2000). This is the "best-case" for Apricodd, and the algorithm can compute this value 
function quite efficiently. Figure 19(a) compares the running time of Apricodd to that of 
one of our algorithms with indicator basis functions between pairs of consecutive variables. 
Note that both algorithms obtain the same policy in polynomial time in the number of 
variables. However, in such structured problems, the efficient implementation of the ADD 
package used in Apricodd makes it faster in this problem. 

On the other hand, the Expon problem illustrates the worst-case for Apricodd. In this 
problem, the action sets the variable X^ to true, if all preceding variables, Xi for i < k, are 
true, and it makes all preceding variables false. If the state space is seen as a binary number, 
the optimal policy goes through all binary numbers in sequence, by repeatedly setting the 
largest bit {X^ variable) which has all preceding bits set to true. Due to discounting, the 
optimal value function assigns a value of ^y 2 "^^ 1 to the jth binary number, so that the 
value function contains exponentially many different values. Using an ADD, the optimal 
value function for this problem requires an exponential number of leaves (Boutilier et al., 
2000), which is illustrated by the exponential running time in Figure 19(b). However, 
the same value function can be approximated very compactly as a factored linear value 
function using n + 1 basis functions: an indicator over each variable X^ and the constant 
base. As shown in Figure 19(b), using this representation, our factored approximate linear 
programming algorithm computes the value function in polynomial time. Furthermore, the 
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Figure 20: Comparing Apricodd to rule-based approximate linear programming with "sin- 
gle+" basis functions on the Process- Sys Admin problem with "Ring" topology 
(a) running time and (b) value of the resulting policy; and with "Star" topology 
(c) running time and (d) value of the resulting policy. 



policy obtained by our approach was optimal for this problem. Thus, in this problem, the 
ability to exploit additive independence allows an efficient polynomial time solution. 

We have also compared Apricodd to our rule-based approximate linear programming 
algorithm on the Process-SysAdmin problem. This problem has significant additive struc- 
ture in the reward function and factorization in the transition model. Although this type of 
structure is not exploited directly by Apricodd, the ADD approximation steps performed by 
the algorithm can, in principle, allow Apricodd to find approximate solutions to the prob- 
lem. We spent a significant amount of time attempting to find the best set of parameters 
for Apricodd for these problems. We settled on the "sift" method of variable reordering 
and the "round" approximation method with the "size" (maximum ADD size) criteria. To 

4. We are very grateful to Jesse Hoey and Robert St-Aubin for their assistance in selecting the parameters. 
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allow the value function representation to scale with the problem size, we set the maximum 
ADD size to 4000 + 400n for a network with n machines. (We experimented with a variety 
of different growth rates for the maximum ADD size; here, as for the other parameters, 
we selected the choice that gave the best results for Apricodd.) We compared Apricodd 
with these parameters to our rule-based approximate linear programming algorithm with 
"single+" basis functions on a Pentium III 700MHz with 1GB of RAM. These results are 
summarized in Figure 20. 

On very small problems (up to 4-5 machines), the performance of the two algorithms is 
fairly similar in terms of both the running time and the quality of the policies generated. 
However, as the problem size grows, the running time of Apricodd increases rapidly, and 
becomes significantly higher than that of our algorithm . Furthermore, as the problem size 
increases, the quality of the policies generated by Apricodd also deteriorates. This difference 
in policy quality is caused by the different value function representation used by the two 
algorithms. The ADDs used in Apricodd represent k different values with k leaves; thus, 
they are forced to agglomerate many different states and represent them using a single value. 
For smaller problems, such agglomeration can still represent good policies. Unfortunately, 
as the problem size increases and the state space grows exponentially, Apricodd's policy 
representation becomes inadequate, and the quality of the policies decreases. On the other 
hand, our linear value functions can represent exponentially many values with only k basis 
functions, which allows our approach to scale up to significantly larger problems. 

10. Related Work 

The most closely related work to ours is a line of research that began with the work of 
Boutilier et al. (1995). We address this comparison separately below, but we begin this 
section with some broader background references. 

10.1 Approximate MDP Solutions 

The field of MDPs, as it is popularly known, was formalized by Bellman (1957) in the 
1950's. The importance of value function approximation was recognized at an early stage 
by Bellman himself (1963). In the early 1990's the MDP framework was recognized by AI 
researchers as a formal framework that could be used to address the problem of planning 
under uncertainty (Dean, Kaelbling, Kirman, & Nicholson, 1993). 

Within the AI community, value function approximation developed concomitantly with 
the notion of value function representations for Markov chains. Sutton's seminal paper on 
temporal difference learning (1988), which addressed the use of value functions for prediction 
but not planning, assumed a very general representation of the value function and noted 
the connection to general function approximators such as neural networks. However, the 
stability of this combination was not directly addressed at that time. 

Several important developments gave the AI community deeper insight into the rela- 
tionship between function approximation and dynamic programming. Tsitsiklis and Van 
Roy (1996a) and, independently, Gordon (1995) popularized the analysis of approximate 
MDP methods via the contraction properties of the dynamic programming operator and 
function approximator. Tsitsiklis and Van Roy (1996b) later established a general con- 
vergence result for linear value function approximators and TD(X), and Bertsekas and 
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Tsitsiklis (1996) unified a large body of work on approximate dynamic programming under 
the name of Neuro- dynamic Programming, also providing many novel and general error 
analyses. 

Approximate linear programming for MDPs using linear value function approximation 
was introduced by Schweitzer and Seidmann (1985), although the approach was somewhat 
deprecated until fairly recently due the lack of compelling error analyses and the lack of an 
effective method for handling the large number of constraints. Recent work by de Farias 
and Van Roy (2001a, 2001b) has started to address these concerns with new error bounds 
and constraint sampling methods. Our approach, rather than sampling constraints, utilizes 
structure in the model and value function to represent all of the constraints compactly. 

10.2 Factored Approaches 

Tatman and Shachter (1990) considered the additive decomposition of value nodes in influ- 
ence diagrams. A number of approaches to factoring of general MDPs have been explored in 
the literature. Techniques for exploiting reward functions that decompose additively were 
studied by Meuleau et al. (1998), and by Singh and Cohn (1998). 

The use of factored representations such as dynamic Bayesian networks was pioneered 
by Boutilier et al. (1995) and has developed steadily in recent years. These methods rely 
on the use of context-specific structures such as decision trees or analytic decision diagrams 
(ADDs) (Hoey et al, 1999) to represent both the transition dynamics of the DBN and 
the value function. The algorithms use dynamic programming to partition the state space, 
representing the partition using a tree-like structure that branches on state variables and 
assigns values at the leaves. The tree is grown dynamically as part of the dynamic pro- 
gramming process and the algorithm creates new leaves as needed: A leaf is split by the 
application of a DP operator when two states associated with that leaf turn out to have 
different values in the backprojected value function. This process can also be interpreted 
as a form of model minimization (Dean & Givan, 1997). 

The number of leaves in a tree used to represent a value function determines the compu- 
tational complexity of the algorithm. It also limits the number of distinct values that can 
be assigned to states: since the leaves represent a partitioning of the state space, every state 
maps to exactly one leaf. However, as was recognized early on, there are trivial MDPs which 
require exponentially large value functions. This observation led to a line of approximation 
algorithms aimed at limiting the tree size (Boutilier & Dearden, 1996) and, later, limiting 
the ADD size (St-Aubin, Hoey, & Boutilier, 2001). Kim and Dean (2001) also explored 
techniques for discovering tree-structured value functions for factored MDPs. While these 
methods permit good approximate solutions to some large MDPs, their complexity is still 
determined by the number of leaves in the representation and the number of distinct values 
than can be assigned to states is still limited as well. 

Tadepalli and Ok (1996) were the first to apply linear value function approximation 
to Factored MDPs. Linear value function approximation is a potentially more expressive 
approximation method because it can assign unique values to every state in an MDP without 
requiring storage space that is exponential in the number of state variables. The expressive 
power of a tree with k leaves can be captured by a linear function approximator with k basis 
functions such that basis function hi is an indicator function that tests if a state belongs 
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in the partition of leaf i. Thus, the set of value functions that can be represented by a 
tree with k leaves is a subset of the set of value functions that can be represented by a 
value function with k basis functions. Our experimental results in Section 9.3 highlight this 
difference by showing an example problem that requires exponentially many leaves in the 
value function, but that can be approximated well using a linear value function. 

The main advantage of tree-based value functions is that their structure is determined 
dynamically during the solution of the MDP. In principle, as the value function representa- 
tion is derived automatically from the model description, this approach requires less insight 
from the user. In problems for which the value function can be well approximated by a rel- 
atively small number of values, this approach provides an excellent solution to the problem. 
Our method of linear value function approximation aims to address what we believe to be 
the more common case, where a large range of distinct values is required to achieve a good 
approximation. 

Finally, we note that Schuurmans and Patrascu (2001), based on our earlier work on 
max-norm projection using cost networks and linear programs, independently developed 
an alternative approach to approximate linear programming using a cost network. Our 
method embeds a cost network inside a single linear program. By contrast, their method 
is based on a constraint generation approach, using a cost network to detect constraint 
violations. When constraint violations are found, a new constraint is added, repeatedly 
generating and attempting to solve LPs until a feasible solution is found. Interestingly, 
as the approach of Schuurmans and Patrascu uses multiple calls to variable elimination in 
order to speed up the LP solution step, it will be most successful when the time spent 
solving the LP is significantly larger than the time required for variable elimination. As 
suggested in Section 9.2, the LP solution time is larger for the table-based approach. Thus, 
Schuurmans and Patrascu's constraint generation method will probably be more successful 
in table-based problems than in rule-based ones. 

11. Conclusions 

In this paper, we presented new algorithms for approximate linear programming and ap- 
proximate dynamic programming (value and policy iteration) for factored MDPs. Both 
of these algorithms leverage on a novel LP decomposition technique, analogous to vari- 
able elimination in cost networks, which reduces an exponentially large LP to a provably 
equivalent, polynomial-sized one. 

Our approximate dynamic programming algorithms are motivated by error analyses 
showing the importance of minimizing error. These algorithms are more efficient and 
substantially easier to implement than previous algorithms based on the £2-projection. Our 
experimental results suggest that they also perform better in practice. 

Our approximate linear programming algorithm for factored MDPs is simpler, easier to 
implement and more general than the dynamic programming approaches. Unlike our policy 
iteration algorithm, it does not rely on the default action assumption, which states that 
actions only affect a small number of state variables. Although this algorithm does not have 
the same theoretical guarantees as max-norm projection approaches, empirically it seems to 
be a favorable option. Our experiments suggest that approximate policy iteration tends to 
generate better policies for the same set of basis functions. However, due to the computa- 
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tional advantages, we can add more basis functions to the approximate linear programming 
algorithm, obtaining a better policy and still maintaining a much faster running time than 
approximate policy iteration. 

Unlike previous approaches, our algorithms can exploit both additive and context- 
specific structure in the factored MDP model. Typical real-world systems possess both 
of these types of structure, thus, this feature of our algorithms will increase the appli- 
cability of factored MDPs to more practical problems. We demonstrated that exploiting 
context-specific independence, by using a rule-based representation instead of the standard 
table-based one, can yield exponential improvements in computational time when the prob- 
lem has significant amounts of CSI. However, the overhead of managing sets of rules make 
it less well-suited for simpler problems. We also compared our approach to the work of 
Boutilier et al. (2000), which exploits only context-specific structure. For problems with 
significant context-specific structure in the value function, their approach can be faster due 
to their efficient handling of the ADD representation. However, there are problems with 
significant context-specific structure in the problem representation, rather than in the value 
function, which require exponentially large ADDs. In some such problems, we demon- 
strated that by using a linear value function our algorithm can obtain a polynomial-time 
near-optimal approximation of the true value function. 

The success of our algorithm depends on our ability to capture the most important 
structure in the value function using a linear, factored approximation. This ability, in turn, 
depends on the choice of the basis functions and on the properties of the domain. The 
algorithms currently require the designer to specify the factored basis functions. This is a 
limitation compared to the algorithms of Boutilier et al. (2000), which are fully automated. 
However, our experiments suggest that a few simple rules can be quite successful for de- 
signing a basis. First, we ensure that the reward function is representable by our basis. A 
simple basis that, in addition, contained a separate set of indicators for each variable often 
did quite well. We can also add indicators over pairs of variables; most simply, we can choose 
these according to the DBN transition model, where an indicator is added between variables 
Xi and each one of the variables in Parents(Xi) , thus representing one-step influences. This 
procedure can be extended, adding more basis functions to represent more influences as 
required. Thus, the structure of the DBN gives us indications of how to choose the basis 
functions. Other sources of prior knowledge can also be included for further specifying the 
basis. 

Nonetheless, a general algorithm for choosing good factored basis functions still does 
not exist. However, there are some potential approaches: First, in problems with CSI, one 
could apply the algorithms of Boutilier et al. for a few iterations to generate partial tree- 
structured solutions. Indicators defined over the variables in backprojection of the leaves 
could, in turn, be used to generate a basis set for such problems. Second, the Bellman 
error computation, which can be performed efficiently as shown in Section 7, does not only 
provide a bound on the quality of the policy, but also the actual state where the error is 
largest. This knowledge can be used to create a mechanism to incrementally increase the 
basis set, adding new basis functions to tackle states with high Bellman error. 

There are many other possible extensions to this work. We have already pursued ex- 
tensions to collaborative multiagent systems, where multiple agents act simultaneously to 
maximize the global reward (Guestrin et al, 2001b), and factored POMDPs, where the 
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full state is not observed directly, but indirectly through observation variables (Guestrin, 
Roller, & Parr, 2001c). However, there are other settings that remain to be explored. In 
particular, we hope to address the problem of learning a factored MDP and planning in a 
competitive multiagent system. 

Additionally, in this paper we have tackled problems where the induced width of the cost 
network is sufficiently low or that possess sufficient context-specific structure to allow for 
the exact solution of our factored LPs. Unfortunately, some practical problems may have 
prohibitively large induced width. We plan to leverage on ideas from loopy belief prop- 
agation algorithms for approximate inference in Bayesian networks (Pearl, 1988; Yedidia, 
Freeman, & Weiss, 2001) to address this issue. 

We believe that the methods described herein significantly further extend the efficiency, 
applicability and general usability of factored models and value functions for the control of 
practical dynamic systems. 
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Appendix A. Proofs 
A.l Proof of Lemma 3.3 

There exists a setting to the weights — the all zero setting — that yields a bounded max- 
norm projection error f3p for any policy ((3p < R m ax)- Our max-norm projection operator 
chooses the set of weights that minimizes the projection error for each policy tt^\ Thus, 
the projection error must be at least as low as the one given by the zero weights (3p 
(which is bounded). Thus, the error remains bounded for all iterations. □ 

A.2 Proof of Theorem 3.5 

First, we need to bound our approximation of V^t)' 
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Moving the second term to the right hand side and dividing through by 1 — 7, we obtain: 
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For the next part of the proof, we adapt a lemma of Bertsekas and Tsitsiklis (1996, Lemma 
6.2, p. 277) to fit into our framework. After some manipulation, this lemma can be refor- 
mulated as: 

II V* - V^+d IL < 7 II V* - V ffW lloo + ^ I Km - HwW L . (25) 

The proof is concluded by substituting Equation (24) into Equation (25) and, finally, in- 
duction on t □ 

A. 3 Proof of Theorem 4.4 

First, note that the equality constraints represent a simple change of variable. Thus, we 
can rewrite Equation (12) in terms of these new LP variables ul\ as: 

0>max^>£, (26) 

i 

where any assignment to the weights w implies an assignment for each ul \ . After this stage, 
we only have LP variables. 

It remains to show that the factored LP construction is equivalent to the constraint in 
Equation (26). For a system with n variables {X±, . . . , X n }, we assume, without loss of 
generality, that variables are eliminated starting from X n down to X\. We now prove the 
equivalence by induction on the number of variables. 

The base case is n = 0, so that the functions q(x) and b(x) in Equation (12) all have 
empty scope. In this case, Equation (26) can be written as: 

<A>E^- ( 27 ) 

i 

In this case, no transformation is done on the constraint, and equivalence is immediate. 

Now, we assume the result holds for systems with i— 1 variables and prove the equivalence 
for a system with i variables. In such a system, the maximization can be decomposed into 
two terms: one with the factors that do not depend on Xj, which are irrelevant to the 
maximization over JQ, and another term with all the factors that depend on Xj. Using this 
decomposition, we can write Equation (26) as: 

d> > max lit? ' 

r i I' : ' 



3 



> max 



E u % + m T ax E «3 



e. 

: Xifai " j : Xiezj 



(28) 



At this point we can define new LP variables u% corresponding to the second term on 
the right hand side of the constraint. These new LP variables must satisfy the following 
constraint: 

i 

<>max5>^ i)[z .,. (29) 
j'=i 
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This new non-linear constraint is again represented in the factored LP construction by a 
set of equivalent linear constraints: 



u. 



(30) 



The equivalence between the non-linear constraint Equation (29) and the set of linear con- 
straints in Equation (30) can be shown by considering binding constraints. For each new 
LP variable created u z , there are |Aj| new constraints created, one for each value Xi of Xi. 
For any assignment to the LP variables in the right hand side of the constraint in Equa- 
tion (30), only one of these |Aj| constraints is relevant. That is, one where J2j=i u ( z x .)[z ■] 
is maximal, which corresponds to the maximum over Aj. Again, if for each value of z more 
than one assignment to Xi achieves the maximum, then any of (and only) the constraints 
corresponding to those maximizing assignments could be binding. Thus, Equation (29) and 
Equation (30) are equivalent. 

Substituting the new LP variables u% into Equation (28), we get: 



> max u' + u 7 , 

~ Xl,...,Xi_l ^ z * 



which does not depend on Xi anymore. Thus, it is equivalent to a system with i — 1 variables, 
concluding the induction step and the proof. □ 

A. 4 Proof of Lemma 7.1 

First note that at iteration t + 1 the objective function cf>( t+1 ^ of the max-norm projection 
LP is given by: 

= |Hw( t+1 ) - (X (t+1) + 7 P^ +1) Hw(*+ 1 ))|| oo . 
However, by convergence the value function estimates are equal for both iterations: 

w (* +1 ) = W W. 

So we have that: 

= |h W W - (i^+i) +7P w( t + i)HwW)| • 
In operator notation, this term is equivalent to: 

^+ 1 ) = |HwW-T w(i+1) HwW 

Note that, 7r(* +1 ) = Greedy(HwW) by definition. Thus, we have that: 

r x(t+ i)HwW = rHwW. 

Finally, substituting into the previous expression, we obtain the result: 



0(*+!)= Hw«-T*Hw« 



□ 
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